Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Knowledge Base

Definitive answers from Designer Desktop experts.

PDF Parsing in Alteryx using R

ShaanM
Alteryx Alumni (Retired)
Created

There was a great post by@ChadM:

Can-Alteryx-Parse-A-Word-Doc-Or-PDF

The main issue with doctotext is that it does not support PDF with images.

In the post @carpoolboy talked about using R and provided some snippet code to do this.

I attached an Alteryx example macro i built which has the R code embedded for you to use.

The great thing here is that you do not need any other executable and it works with PDF files containing images.

Please make sure you install two Rpackages first:

Rcpp

Pdftools

Enjoy

ShaanM

Edited 28/11/18 - attached a Batch example

Attachments
Comments
ChadM
Alteryx Alumni (Retired)

@ShaanM, this is awesome!  

 

With the new Python SDK, I'd also like to see if we could use something like PDFMiner (Python library) to do this.  

Swift34
6 - Meteoroid

First I would like to thank you for your effort on this, most outstanding!

 

My first attempt at using this macro gave me an error because the underlying R library was out dated. I just used the Rgui tool on my machine to update all of the R libraries to the latest in the CRAN repository and then your macro worked!

 

Again, thank you.

ShaanM
Alteryx Alumni (Retired)

@Swift34 Glad you found it useful !

 

If you find you have multiple files, you could easily turn the macro into a batch macro, that would loop through each file one at a time.pdfparseexamplebatch.jpg

 

Simply edit the macro and place anywhere on the canvas a 'Control parameter', not connected to anything. Dont forget to save the new macro.

 

Then on the original workflow feed in the stream to the two inputs:

Ollie_Power
6 - Meteoroid

Thanks @ShaanM,

 

This was also a great help to me. To help others, I wrote up a short blog post using an example PDF:

Parsing PDFs using Alteryx and a little R

Dynamomo
11 - Bolide

Thanks @ShaanM and @Ollie_Power

This was much easier than the old methodology with the doctotext executable.  Just minutes to type in the code and done!

PhilH
Alteryx Alumni (Retired)

@ShaanM @ChadM 

How about OCR?  If I scan a PDF, and the image is kind of rough, is there a plug-in or something that can convert it to readable text?

Dynamomo
11 - Bolide

@PhilH - if the user can find a command line OCR executable then they can run it from within Alteryx in the Run Command tool and convert the images to text.

ShaanM
Alteryx Alumni (Retired)

@PhilH another avenue to explore is R. R has many other packages that could be leveraged. I stumbled across one called Tesseract which might get you close.

 

https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html

 

TS-User
7 - Meteor

Hi Shaan

 

Thanks for sharing the Parse Workflow. However, I have troubles importing the PDF file. 

Could you please share a screenshot, which options you chose by connecting the file (e.g. csv ends up in error message).

 

Thanks

Kristina

ShaanM
Alteryx Alumni (Retired)

Hi @TS-User

 

The very first tool, the file specification needs to be .pdf

 

This tool points to a directory where your pdf file resides.

 

Kind regards

 

Shaan

TS-User
7 - Meteor

Unfortunately there is no pdf extention in the input tool. How can I solve this?

TIA

ShaanM
Alteryx Alumni (Retired)

Hi @TS-User

 

In my workflow which is at the top of this page it was to use a directory input.

 

I also attached in an earlier post a second workflow example to loop through multiple files. Each one uses a directory input not a standard input tool. There you should be specifying .pdf.

 

see this screen shot:PDF.jpg

 

 

hope this helps

 

Shaan

DmtCoj
7 - Meteor

Hello @ShaanM,

 

Thank you for your time spent on searching for the new ways of doing thingsSmiley Happy

I'm wondering if you have successfully used the "Magick" package before letting tesseract do the job.

Maybe you can share some nasty example. 

ShaanM
Alteryx Alumni (Retired)

@DmtCoj

 

Glad you found this insightful

 

In the coming weeks I am hoping to have time to build something out, and then produce a Community post.

 

I will keep you updated once i have a working example

 

Shaan

JaskiratChohan
5 - Atom

@S


@ShaanM wrote:

There was a great post by @ChadM:

 

Can-Alteryx-Parse-A-Word-Doc-Or-PDF

 

The main issue with doctotext is that it does not support PDF with images.

 

In the post @carpoolboy talked about using R and provided some snippet code to do this.

 

I attached a v11.3  Alteryx example macro i built which has the R code embedded for you to use.

 

The great thing here is that you do not need any other executable and it works with PDF files containing images.

 

Please make sure you install two R packages first:

 

Rcpp

Pdftools

 

Enjoy  

ShaanM


I was not able to execute  it for a batch ,can you please help me with how to put /define a control parameter

ShaanM
Alteryx Alumni (Retired)

hi @JaskiratChohan

 

Edit the macro and place a control parameter anywhere on the canvas

 

BatchPDF.jpg

 

 

 

Then go to view>interface designer and specify output schema will change.

 

BatchPDF2.jpg

barry
5 - Atom

Any way for this to work on PDFs with multiple pages?  I'm only able to pull page one

ShaanM
Alteryx Alumni (Retired)

@barry my original post of the macro should work with multiple pages automatically.

 

Please drop me a direct message if it does not work. 

 

Kind regards

 

Shaan

RishiK
Alteryx
Alteryx

@ShaanM fantastic way to do this!

kat
12 - Quasar

@ShaanM we've used your macro as a base for so many things. Thanks!

 

Did you ever get a chance into looking at reading in image in PDFs?

kelvin_law1
9 - Comet

Hi @ShaanM,

 

Great PDF Parse tool, thanks!!

 

However, I couldn't make reading multiple PDF files work by following your batch macro instructions.

In my workflow,  I just want to scrap 2 PDF files by inputting the filenames with path in the Text Input tool.  So how should I config the batch macro?  Should I input the GroupBy fields in the config of that batch macro?

 

If I don't input the GroupBy fields, I've got an Error: PDFParserBatch (16): Record #1: Tool #2: Error in file(con, "rb") : invalid 'description' argument

NoGroupBy.png

 

 

If I input the GroupBy fields, the macro can only parse 1 file.

WithGroupBy.png

 

Please give me some advice, thanks very much!!

 

BenMoss
ACE Emeritus
ACE Emeritus

@kelvin_law1, here is another tool you could try which does a similar thing, and by default is set up as a batch macro to process multiple files within a directory.

 

https://gallery.alteryx.com/#!app/PDF-Input/5b685aff0462d710907f7a3b

 

Ben

kelvin_law1
9 - Comet

Hi @BenMoss,

 

Thanks for your prompt reply.  Would you mind telling me the link of the tool that you mentioned in your post?

ShaanM
Alteryx Alumni (Retired)

@kelvin_law1

 

I have edited my original post and added in a batch example

 

hope this helps you

BenMoss
ACE Emeritus
ACE Emeritus

@kelvin_law1 that was stupid of me! Have amended the post!

 

Ben

JoeS
Alteryx
Alteryx

@kelvin_law1

 

I think the group by option is the one you need. The issue you have is that there isnt anything downstream of the batch macro.

 

If you add a browse tool or something else downstream of the batch macro I think you will find it will then batch.

 

It's a "feature" of using the browse anywhere rather than having tools downstream. Alteryx tries to be efficient and only process what's needed, AKA one batch for you to see in your browse anywhere sample.

asmit_kumar_pwc
7 - Meteor

It works perfectly. Much appreciated.. Thank You so much. 

carlosmartinezm
7 - Meteor

Thank you for your effort !

Still waiting for some admin rights to be able to run Alteryx as an Admin to install the packages.

But this is exactly what i was looking for

 

Kiitos!

LFLee
8 - Asteroid

Hello @ShaanM

 

I have installed the R package and the Pdftool package but when I use the workflow, i got this error below. Is there anything I'm missing out? Appreciate any solutions to this.

 

2019-09-21 23_29_42-Alteryx Designer x64 - PDFBatchExample.yxmd_.png

 

ShaanM
Alteryx Alumni (Retired)

@LFLee 

 

If that error appears, it might be that you are installing the r packages as non admin. 

 

Ensure you are the admin, then try installing the two r packages again.

 

Rcpp

Pdftools 

LFLee
8 - Asteroid

@Sha2019-09-22 18_53_46-Network access.pnganM

 

Thank you for your reply.

 

I've run the RGui as an admin and manage to install both packages but received the error when running the workflow (attached the two screenshots on the installation). Any tips or solutions that can help to resolve this error is much appreciated.

 

2019-09-22 18_50_43-RGui (64-bit).png

ShaanM
Alteryx Alumni (Retired)

@LFLee might be worth emailing and getting a ticket open with our client service team.

 

Support@alteryx.com 

 

They should be able to diagnose faster what is going wrong.

 

One other thing to try is once packages installed, then to open designer. Right click and run designer as administrator.

 

Kind regards

JoeS
Alteryx
Alteryx

Hi @LFLee 

 

As above, please look to make sure to run the R tool within Alteryx Desginer rather than the RGUI. 

 

What version of Alteryx designer do you have installed? Admin?

LFLee
8 - Asteroid

Hi @JoeS

 

The version of Alteryx designer I have installed is 2018.4 which has R3.4.4, which I realised that the pdftool only works with R3.5.1 and above. Do you know if there is any pdftool that works with R3.4.4? 

 

Thanks in advance.

JoeS
Alteryx
Alteryx

Hi @LFLee ,

 

I am not sure there is a version that works with the older version of R (unfortunately navigating CRAN has never been a strength of mine).

 

My recommendation would be to update your version of Alteryx, as it's currently 9 months old 🙂

LFLee
8 - Asteroid

Hi @JoeS

 

Thank you for your advice. Have managed to update the version of Alteryx and it works. However, the only thing is when i use a batch macro to parse the pdf, and I select a pdf file, I got the error, file type not recognised even though pdftools has been installed.

 

Any suggestion how to resolve this?

JoeS
Alteryx
Alteryx

That's odd, are you able to send a screenshot of the actual error?

LFLee
8 - Asteroid

Hi @JoeS

 

This is the error.

 

2019-09-25 00_09_22-Alteryx Designer x64 - ShaanPDFParserMultiFileBatch.yxmc.png

JoeS
Alteryx
Alteryx

Ah, it looks like you are modifying the Macro itself. You can use that in a different workflow whereby you push the two inputs it require in. It should need any modification internally to work.

Samit89
5 - Atom

Dear ShaanM and JoeS,

 

I am trying to use use your snippet code but I am facing small issue in that.

 

I am using PDF with image as input and after running the code I got the blank output.

 

Can you please let us know the reason for this.

ShaanM
Alteryx Alumni (Retired)

@Samit89 

 

sorry I have been on leave. 

 

These packages and the code used is mainly to handle text within pdf's rather than images.

 

Kind regards

 

Shaan

Ekta
8 - Asteroid

Hi , 

Thank you for this amazing workflow, its giving me the expected output.

i want to seek your help on the R script , how can i split the data in different cells of the excel as currently all the data is coming in only 1 cell(1,1)

Ekta
8 - Asteroid

@ShaanM 

Hi , 

Thank you for this amazing workflow, its giving me the expected output.

i want to seek your help on the R script , how can i split the data in different cells of the excel as currently all the data is coming in only 1 cell(1,1)

JoeS
Alteryx
Alteryx

Hi @Ekta 

 

Fortunately/Unfortunately - that's part of the fun. The R Script it only able to bulk read the PDF into a single cell. 

 

It's a this point though where you can leverage the Alteryx tools to enable you to get the data into the right format.

 

Without knowing your PDF format/structure I'd say you'll almost certainly want to be using Text To Columns to split to rows based upon "\n" a new line character. Then you'll need to move into parsing out the columns in the table.

 

One other thing I want to mention is that in part of the Intelligence Suite we released last year there is a fantastic way to parse through PDF files and read them into a tabular format. 

More details can be found here: https://www.alteryx.com/products/alteryx-platform/intelligence-suite 

ShaanM
Alteryx Alumni (Retired)

@Ekta 

 

As @JoeS mentioned, the next best step would be to use the preparation and parse tools within Alteryx Designer to get to your desired end results.

If there is anything in your data which could be used to split the data into separate columns. That could be used as a delimiter within Text to Columns.

https://help.alteryx.com/current/designer/text-columns-tool

White Space White Space Character
Tab \t
New Line \n
Space \s
Space or Tab \s\t

But you can also use anything that could appear e.g punctuation and pipes (|)

 

I would tend not to do this is the R script as it does not lend itself to being reusable and transparent for the future.

 

As Joe mentioned another route is the Intelligence suite add on. - If you speak to your Alteryx Account Manager I am sure they can organize a trial for you.

Kind regards

Shaan

Idyllic_Data_Geek
8 - Asteroid

Idyllic_Data_Geek_0-1625689121549.png

why am I getting an error when trying to use the text input to connect to a pdf file?

 

Idyllic_Data_Geek
8 - Asteroid

I have a scanned images of the document coming in PDF format. I need 2 pieces of info from the whole document. I will like to do this for bulk letters. Please help!

JoeS
Alteryx
Alteryx

What are you doing in order to get that error? It looks like you are trying to open the file using File > Open in the top left?

 

We also have the Alteryx Intelligence Suite that has a much more rich feature set when it comes to text mining images and/or PDFs that you may want to explore

Idyllic_Data_Geek
8 - Asteroid

@JoeS That is correct. I' clicking the highlighted to connect to the file

 

Idyllic_Data_Geek_0-1625755891924.png

 

JoeS
Alteryx
Alteryx

Ah ok, that needs to be the path to the PDF in there not the PDF it's.

 

Try with that an let me know.