Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Reading text from PDF file

Idyllic_Data_Geek
8 - Asteroid

I have a requirement of scanning a PDF document for a required piece of information and then extract it to excel file. Any possible way of doing this in Alteryx without having to go through the route of Python? The PDF input does not work for me as my employer has not paid for the upgraded functions in Alteryx. Thanks in Advance!

14 REPLIES 14
dougperez
12 - Quasar

You will have to use the PDF input or python... I don't know any other method to do that, see the link below:
https://community.alteryx.com/t5/Alteryx-Designer-Discussions/How-To-Input-PDF-to-convert-to-Excel/t....

JosephSerpis
17 - Castor
17 - Castor

You can use R instead of Python however that is still a coding approach.

 

https://community.alteryx.com/t5/Alteryx-Designer-Knowledge-Base/PDF-Parsing-in-Alteryx-using-R/ta-p...

 

Idyllic_Data_Geek
8 - Asteroid

@JosephSerpis can you please assist me this R solution?

JosephSerpis
17 - Castor
17 - Castor

What do you need help with?

Idyllic_Data_Geek
8 - Asteroid

I have a scanned letter so I think it is an image in PDF format.....I need to read the 2 pieces of information from it which was always be in the same place. The Python and the R solution is giving me errors...

JosephSerpis
17 - Castor
17 - Castor

Both Python and R approaches are about tacking Text in a PDF document rather than an Image. The screenshot below show the details from the R package being used in the example I shared.

PDF_TEXT_R.JPG 

Idyllic_Data_Geek
8 - Asteroid

So how can I extract the data out of an image. I can't even install the extra R packages on my machine that some one else had mentioned here

markcurry
12 - Quasar

Hi @Idyllic_Data_Geek 

 

If your PDF files haven't been OCR'ed you can use this 'PDF Input (Text and Image)' tool created by @DiganP ,

https://gallery.alteryx.com/#!app/PDF-Input--Text-and-Image-/5be5ec8d0462d71ffce6deaa

 

This tool uses 2 additional R packages (pdftools and tesseract).  If you are blocked from installing R packages to your C:\Program Files\Alteryx\R-.... folder, you could try running the two workflows attached that will install them to C:\Users\<username>\Documents\R\win-library\<version>

 

Hopefully that helps.

Idyllic_Data_Geek
8 - Asteroid

@markcurry I get the below error as I have the 2020 version installed on my Company computer.

Idyllic_Data_Geek_0-1625833903699.png

 

Labels