community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx Designer Knowledge Base

Definitive answers from Designer experts.
Upgrade Alteryx Designer in 10 Steps

Debating whether or not to upgrade to the latest version of Alteryx Designer?

LEARN MORE

PDF Parsing in Alteryx using R

Alteryx
Alteryx
Created on

There was a great post by @ChadM:

 

Can-Alteryx-Parse-A-Word-Doc-Or-PDF

 

The main issue with doctotext is that it does not support PDF with images.

 

In the post @carpoolboy talked about using R and provided some snippet code to do this.

 

I attached an Alteryx example macro i built which has the R code embedded for you to use.

 

The great thing here is that you do not need any other executable and it works with PDF files containing images.

 

Please make sure you install two R packages first:

 

Rcpp

Pdftools

 

Enjoy  

ShaanM

 

Edited 28/11/18 - attached a Batch example

Attachments
Comments
Alteryx Alumni (Retired)

@ShaanM, this is awesome!  

 

With the new Python SDK, I'd also like to see if we could use something like PDFMiner (Python library) to do this.  

Meteoroid

First I would like to thank you for your effort on this, most outstanding!

 

My first attempt at using this macro gave me an error because the underlying R library was out dated. I just used the Rgui tool on my machine to update all of the R libraries to the latest in the CRAN repository and then your macro worked!

 

Again, thank you.

Alteryx
Alteryx

@Swift34 Glad you found it useful !

 

If you find you have multiple files, you could easily turn the macro into a batch macro, that would loop through each file one at a time.pdfparseexamplebatch.jpg

 

Simply edit the macro and place anywhere on the canvas a 'Control parameter', not connected to anything. Dont forget to save the new macro.

 

Then on the original workflow feed in the stream to the two inputs:

Meteoroid

Thanks @ShaanM,

 

This was also a great help to me. To help others, I wrote up a short blog post using an example PDF:

Parsing PDFs using Alteryx and a little R

Alteryx Certified Partner

Thanks @ShaanM and @Ollie_Power

This was much easier than the old methodology with the doctotext executable.  Just minutes to type in the code and done!

Alteryx
Alteryx

@ShaanM @ChadM 

How about OCR?  If I scan a PDF, and the image is kind of rough, is there a plug-in or something that can convert it to readable text?

Alteryx Certified Partner

@PhilH - if the user can find a command line OCR executable then they can run it from within Alteryx in the Run Command tool and convert the images to text.

Alteryx
Alteryx

@PhilH another avenue to explore is R. R has many other packages that could be leveraged. I stumbled across one called Tesseract which might get you close.

 

https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html

 

Meteor

Hi Shaan

 

Thanks for sharing the Parse Workflow. However, I have troubles importing the PDF file. 

Could you please share a screenshot, which options you chose by connecting the file (e.g. csv ends up in error message).

 

Thanks

Kristina

Alteryx
Alteryx

Hi @TS-User

 

The very first tool, the file specification needs to be .pdf

 

This tool points to a directory where your pdf file resides.

 

Kind regards

 

Shaan

Meteor

Unfortunately there is no pdf extention in the input tool. How can I solve this?

TIA

Alteryx
Alteryx

Hi @TS-User

 

In my workflow which is at the top of this page it was to use a directory input.

 

I also attached in an earlier post a second workflow example to loop through multiple files. Each one uses a directory input not a standard input tool. There you should be specifying .pdf.

 

see this screen shot:PDF.jpg

 

 

hope this helps

 

Shaan

Meteor

Hello @ShaanM,

 

Thank you for your time spent on searching for the new ways of doing thingsSmiley Happy

I'm wondering if you have successfully used the "Magick" package before letting tesseract do the job.

Maybe you can share some nasty example. 

Alteryx
Alteryx

@DmtCoj

 

Glad you found this insightful

 

In the coming weeks I am hoping to have time to build something out, and then produce a Community post.

 

I will keep you updated once i have a working example

 

Shaan

@S


@ShaanM wrote:

There was a great post by @ChadM:

 

Can-Alteryx-Parse-A-Word-Doc-Or-PDF

 

The main issue with doctotext is that it does not support PDF with images.

 

In the post @carpoolboy talked about using R and provided some snippet code to do this.

 

I attached a v11.3  Alteryx example macro i built which has the R code embedded for you to use.

 

The great thing here is that you do not need any other executable and it works with PDF files containing images.

 

Please make sure you install two R packages first:

 

Rcpp

Pdftools

 

Enjoy  

ShaanM


I was not able to execute  it for a batch ,can you please help me with how to put /define a control parameter

Alteryx
Alteryx

hi @JaskiratChohan

 

Edit the macro and place a control parameter anywhere on the canvas

 

BatchPDF.jpg

 

 

 

Then go to view>interface designer and specify output schema will change.

 

BatchPDF2.jpg

Atom

Any way for this to work on PDFs with multiple pages?  I'm only able to pull page one

Alteryx
Alteryx

@barry my original post of the macro should work with multiple pages automatically.

 

Please drop me a direct message if it does not work. 

 

Kind regards

 

Shaan

Alteryx
Alteryx

@ShaanM fantastic way to do this!

Quasar

@ShaanM we've used your macro as a base for so many things. Thanks!

 

Did you ever get a chance into looking at reading in image in PDFs?

Alteryx Certified Partner

Hi @ShaanM,

 

Great PDF Parse tool, thanks!!

 

However, I couldn't make reading multiple PDF files work by following your batch macro instructions.

In my workflow,  I just want to scrap 2 PDF files by inputting the filenames with path in the Text Input tool.  So how should I config the batch macro?  Should I input the GroupBy fields in the config of that batch macro?

 

If I don't input the GroupBy fields, I've got an Error: PDFParserBatch (16): Record #1: Tool #2: Error in file(con, "rb") : invalid 'description' argument

NoGroupBy.png

 

 

If I input the GroupBy fields, the macro can only parse 1 file.

WithGroupBy.png

 

Please give me some advice, thanks very much!!

 

Alteryx Certified Partner
Alteryx Certified Partner

@kelvinlaw, here is another tool you could try which does a similar thing, and by default is set up as a batch macro to process multiple files within a directory.

 

https://gallery.alteryx.com/#!app/PDF-Input/5b685aff0462d710907f7a3b

 

Ben

Alteryx Certified Partner

Hi @BenMoss,

 

Thanks for your prompt reply.  Would you mind telling me the link of the tool that you mentioned in your post?

Alteryx
Alteryx

@kelvinlaw

 

I have edited my original post and added in a batch example

 

hope this helps you

Alteryx Certified Partner
Alteryx Certified Partner

@kelvinlaw that was stupid of me! Have amended the post!

 

Ben

Alteryx
Alteryx

@kelvinlaw

 

I think the group by option is the one you need. The issue you have is that there isnt anything downstream of the batch macro.

 

If you add a browse tool or something else downstream of the batch macro I think you will find it will then batch.

 

It's a "feature" of using the browse anywhere rather than having tools downstream. Alteryx tries to be efficient and only process what's needed, AKA one batch for you to see in your browse anywhere sample.

Meteoroid

It works perfectly. Much appreciated.. Thank You so much.