Alteryx Designer Desktop Discussions

RDF25087 · ‎06-15-2022

Hi all -

Not sure if this can even be done. However....

I have a .pdf document with house builder data. From page 55 is a list of all house builders and their contact details. What I would like to do is extract their company name (in the blue bar) and return their UK full postcode.

Any help would be greatly appreciated.

RDF

davidskaife · ‎06-15-2022

Hi @RDF25087

There is a macro here - https://community.alteryx.com/t5/Public-Community-Gallery/PDF-Input/ta-p/887038 but there are a few pre-requisites before you can run it.

Another option is if you have the Intelligence Suite Licence - https://www.alteryx.com/products/intelligence-suite which has extraction from PDF capabilities

RDF25087 · ‎06-15-2022

Hi @davidskaife

Thank you for the quick reply. I doesn't look like we have the Intelligence Suite License - so I'll take a crack at the macro solution first.

RDF

32bit · ‎06-15-2022

That macro looks like it uses R to parse the PDF. IMO R does not do a good job. I've worked with PDF files a number of times in production over the past couple of decades and the best free solution that I've found is the Xpdf command-line tools. These are no-frills exe files, and the results are better than R. There are multiple parsing options (check the --help switch). I've always had the best results using -layout and -table switches depending on how the document is formatted. Once converted, one would use regex and logic to parse and ensure no data is lost in conversion.

These are command-line paramaters, so remember to use "quotes" if there is a space in the path or file name. Test in the shell before putting into Alteryx.

pdftotext.exe -layout file.pdf file.txt

or

pdftotext.exe -table file.pdf file.txt

To use with Alteryx, you'd just set up the run tool and read in the text results using either the run tool itself, or the blob tool in special cases if the results are mangled by the Alteryx tool. I have this deployed on the gallery at work.

Alteryx Designer Desktop Discussions

Extract Data from PDF

Re: Row creation

Re: How to select columns dynamically using number...

Re: Batch macro to read 1000+ .xlsx files with var...

Re: Issue when using Block Until Done and Power BI...

Example workflow for setting up a custom list to u...