Alteryx Designer Desktop Discussions

Pranab_C · ‎12-24-2023

Hi Alteryx Champions, I am trying to extract data from a PDF file, I have been able to get all the other pages extracted using R code, however this page attached that is not coming thru appropriately. Any help on a Regex or formulae tool to get this information would be great. To be more specific i want all the locations to come in one column and the workdays to be in the second column. Not so much worried about the other columns. I need location, workdays and COVID workdays.

alexnajm · ‎12-24-2023

Go ahead and give us your workflow too so we can help out with however it’s coming through! The best way is to Export the workflow under Options

Raj · ‎12-27-2023

DO you have Alteryx Intelligence Suite?

alexnajm · ‎12-27-2023

@Raj, @Pranab_C is saying that they are using R code, which makes me think the PDF input tool from the gallery is being used

Raj · ‎12-27-2023

I think i Missed it.
thank for bringing it to my attention.

Pranab_C · ‎12-27-2023

This is the code that i have, I am unable to extract the table information properly

library('pdftools')
library('tibble')
library('dplyr')

data <- read.Alteryx("#1", mode="data.frame")
pdf_file <- file.path(data$FullPath)
txtdata <- pdf_data(pdf_file)

output <- txtdata[[1]] %>% add_column(page = 1, .before = "width")

if(length(txtdata)>1){
for(i in 2:length(txtdata)){
    data <- txtdata[[i]] %>% add_column(page = i, .before = "width")
    output <- bind_rows(output,data)
}
}

write.Alteryx(data.frame(output), 3)

alexnajm · ‎12-27-2023

@Pranab_C as mentioned, it would be best to have your workflow so we can see what that code is producing. That way, we can suggest the right formula / regex on the data

Pranab_C · ‎12-27-2023

Attaching the sample PDF file and the workflow package. these are just two pages from a typical file, but data would be in the same format.

alexnajm · ‎12-28-2023

I am sorry, but I am not an R expert so I can't directly affect your code and I am not sure what your output is. It does run but it’s different than what I’m used to - I use this tool with R code to help out and perhaps you could use it in this case: PDF Input - Alteryx Community. I used it and it read the data in better than the way in the .yxzp.

Then you can use Alteryx afterwards to parse out the parts you need. I can try to help with this, but only if it’s the path you want to go down. Good luck!

CoG · ‎12-28-2023

Utilizing the structure you had prior, here is a workflow (not the prettiest) that parses that pdf.

Alteryx Designer Desktop Discussions

Extracting data from a PDF page

Re: Change Data Type of Input Data before Reading

Re: Change Data Type of Input Data before Reading

Re: Join versus Union

Re: Filter

Re: Regex help please - Parsing a big text area