Join the Alteryx Community’s Maveryx Summer Cup event! Compete, network with others, and earn your gold through a series of challenges from July 24th to August 11th. Learn more about the event here.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Extracting data from a PDF page

Pranab_C
8 - Asteroid

Hi Alteryx Champions, I am trying to extract data from a PDF file, I have been able to get all the other pages extracted using R code, however this page attached that is not coming thru appropriately. Any help on a Regex or formulae tool to get this information would be great. To be more specific i want all the locations to come in one column and the workdays to be in the second column. Not so much worried about the other columns. I need location, workdays and COVID workdays.

10 REPLIES 10
alexnajm
17 - Castor
17 - Castor

Go ahead and give us your workflow too so we can help out with however it’s coming through! The best way is to Export the workflow under Options

Raj
15 - Aurora

DO you have Alteryx Intelligence Suite?

alexnajm
17 - Castor
17 - Castor

@Raj@Pranab_C is saying that they are using R code, which makes me think the PDF input tool from the gallery is being used

Raj
15 - Aurora

I think i Missed it.
thank for bringing it to my attention.

Pranab_C
8 - Asteroid

This is the code that i have, I am unable to extract the table information properly

 

library('pdftools')
library('tibble')
library('dplyr')

data <- read.Alteryx("#1", mode="data.frame")
pdf_file <- file.path(data$FullPath)
txtdata <- pdf_data(pdf_file)

output <- txtdata[[1]] %>% add_column(page = 1, .before = "width")

if(length(txtdata)>1){
for(i in 2:length(txtdata)){
    data <- txtdata[[i]] %>% add_column(page = i, .before = "width")
    output <- bind_rows(output,data)
}
}

write.Alteryx(data.frame(output), 3)
alexnajm
17 - Castor
17 - Castor

@Pranab_C as mentioned, it would be best to have your workflow so we can see what that code is producing. That way, we can suggest the right formula / regex on the data

Pranab_C
8 - Asteroid

Attaching the sample PDF file and the workflow package. these are just two pages from a typical file, but data would be in the same format.

alexnajm
17 - Castor
17 - Castor

I am sorry, but I am not an R expert so I can't directly affect your code and I am not sure what your output is. It does run but it’s different than what I’m used to - I use this tool with R code to help out and perhaps you could use it in this case: PDF Input - Alteryx Community. I used it and it read the data in better than the way in the .yxzp.

 

Then you can use Alteryx afterwards to parse out the parts you need. I can try to help with this, but only if it’s the path you want to go down. Good luck!

AndrewDMerrill
13 - Pulsar

Utilizing the structure you had prior, here is a workflow (not the prettiest) that parses that pdf.

Labels