Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Extract Data from PDF

RDF25087
8 - Asteroid

Hi all -

 

Not sure if this can even be done. However....

 

I have a .pdf document with house builder data. From page 55 is a list of all house builders and their contact details. What I would like to do is extract their company name (in the blue bar) and return their UK full postcode.

 

Any help would be greatly appreciated.

 

RDF

3 REPLIES 3
DavidSkaife
13 - Pulsar

Hi @RDF25087 

 

There is a macro here - https://community.alteryx.com/t5/Public-Community-Gallery/PDF-Input/ta-p/887038 but there are a few pre-requisites before you can run it.

 

Another option is if you have the Intelligence Suite Licence - https://www.alteryx.com/products/intelligence-suite which has extraction from PDF capabilities

RDF25087
8 - Asteroid

Hi @DavidSkaife 

 

Thank you for the quick reply. I doesn't look like we have the Intelligence Suite License - so I'll take a crack at the macro solution first.

 

RDF

32bit
8 - Asteroid

That macro looks like it uses R to parse the PDF. IMO R does not do a good job. I've worked with PDF files a number of times in production over the past couple of decades and the best free solution that I've found is the Xpdf command-line tools. These are no-frills exe files, and the results are better than R. There are multiple parsing options (check the --help switch). I've always had the best results using -layout and -table switches depending on how the document is formatted. Once converted, one would use regex and logic to parse and ensure no data is lost in conversion.

 

These are command-line paramaters, so remember to use "quotes" if there is a space in the path or file name. Test in the shell before putting into Alteryx.

pdftotext.exe -layout file.pdf file.txt

or

pdftotext.exe -table file.pdf file.txt

 

To use with Alteryx, you'd just set up the run tool and read in the text results using either the run tool itself, or the blob tool in special cases if the results are mangled by the Alteryx tool. I have this deployed on the gallery at work.

Labels