Hi There,
I have recently got hold of the Alteryx BI Suite. I am playing with the tool to do a POC on using PDF as my Source. I am attaching a sample PDF that I have. From this PDF I need to read just 5 rows and load them to an excel. I was trying to use Image Template and then convert it to text and then finally load it to an excel. I have seen that the markers on PDF change with every, I need to to annotations always. The process is everyday a new file comes, I need to ready exactly those 5 rows and load data to an excel. Can anyone guide me on this? I also tried another way of converting the file to rows & Columns but my workflow has been running from 1 Hour. Attaching my PDF & both workflows.
Note I just want to read page 1, there are 5 rows highlighted in yellow. I need those 5 rows along with last row (Total By Business) in a single excel.
I am new to alteryx and would like to explore more, hope I get a solution soon.
Thanks in Advance
M
I also have the Intel Suite. I have not downloaded your workflows. But, I did look at the the PDF. I have found the new tools to be less than optimal if your target text within the PDF shifts (even just a bit) and performance is not exactly speedy.
Do you have Acrobat Pro DC? And, have you tried copy / paste (keep formatting) from the PDF to Excel. You could also try exporting from Acrobat to Excel. It might work, and it could be faster.
Hey Hellyars,
Thank you for taking time and responding. Actually I do not have Acrobat Pro DC also I want to automate the entire process. Like everyday I will be getting new PDF so I need to read it everyday, convert to text then use it to match with the results of a Stored Procedure from SQL.
It would be a good Use Case for us to show Capabilities of Alteryx. Would be good if we could read the PDF without annotations.
Thanks
Mansi C.
Hi @Mansi3
I don't have the BI suite, so I have to do it the old fashioned way!
There is a brilliant macro on the Alteryx Gallery that you can download called PDF Input. It uses the R package, so you have to have Predictive Tools installed and it also needs a little bit of setup, but is's straightforward.
Here's the link: https://gallery.alteryx.com/#!app/PDF-Input/5b685aff0462d710907f7a3b
And here's the documentation: https://til.bi/2Xee6mA
This can/will load all the PDFs in a folder that you specify. You can play around with the wildcard input to limit that. For starters I'd suggest just putting one file (your example file) in a folder and play around with that.
The PDF Input will load each row of the PDF file as a singe column of data, so then you have to parse it. Your file has a nice, predictable format, so this is not too difficult with Regex. If you're not too familiar with Regex, there are tons of posts and other resources on it here in the Community.
I've written a worklfow that reads your example file with the PDF Input macro and cleans it up and extracts the 5 rows you want. You'll have to install PDF Input to get that part to work, but I've also copied the output from the PDF Input macro for your example file into a Text Input tool to show you how the rest of the workflow works.
Have a play with it and let me know if you have any questions.
For some reason I found that the Intelligence Suite struggled with this PDF, so I reverted to the R library "pdftools"
The workflow is attached,
where the R code is,
library(pdftools, lib.loc="C:/Users/Philip/Anaconda3/envs/InsightsTool_vEnv/lib/R/library")
df <- read.Alteryx("#1", mode="data.frame")
paths <- as.character(df$FullPath)
txt = pdf_text(paths[1])
df <- data.frame(txt=txt)
write.Alteryx(df, 1)
So you will probably need to install pdftools first for this to work.
P
Thank you David,
This really help me. But I have one issue, I am unable to install the R package since I do not have the dll file. I tried to figure out where I can get it but no luck. Could you help me with that?
Unless I have that I cannot proceed. Since I did not have the dll I replaced your PDF macro with PDF Input (BI Suite). It ran but you see the text doesn't change when file changes. Can you guide me how can I get the R plugin? Attaching my Package.
The PDF file that I attached is older one, we will be getting the PDF on Regular basis.
Hey Philip,
Thanks for helping out, well I don't see R plugin on my system.
Not sure how can I get it
Thanks
Mansi C.
Hi @Mansi3
The easiest way to install R is to install the Alteryx Predictive analytics package. You download it from the same place as the Alteryx installer.
In my workflow, the input is a static Text input object. We need to see what the output of your BI Suite PDF Input looks like and if it's the same, replace the Text Input with it.
But you're right, the best way would be to get the R package working.
I'm testing a solution I developed with the Image reader and running into the exact problem you flagged - "I have found the new tools to be less than optimal if your target text within the PDF shifts (even just a bit) and performance is not exactly speedy."
The position of the relevant content that I want to extract from my PDF input shifts for different months, which means if I use the annotation method it picks up the wrong data from the PDF. Is there a way to dynamically select content from PDFs without using annotations?