This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Hi all -
Not sure if this can even be done. However....
I have a .pdf document with house builder data. From page 55 is a list of all house builders and their contact details. What I would like to do is extract their company name (in the blue bar) and return their UK full postcode.
Any help would be greatly appreciated.
RDF
Hi @RDF25087
There is a macro here - https://community.alteryx.com/t5/Public-Community-Gallery/PDF-Input/ta-p/887038 but there are a few pre-requisites before you can run it.
Another option is if you have the Intelligence Suite Licence - https://www.alteryx.com/products/intelligence-suite which has extraction from PDF capabilities
Hi @DavidSkaife
Thank you for the quick reply. I doesn't look like we have the Intelligence Suite License - so I'll take a crack at the macro solution first.
RDF
That macro looks like it uses R to parse the PDF. IMO R does not do a good job. I've worked with PDF files a number of times in production over the past couple of decades and the best free solution that I've found is the Xpdf command-line tools. These are no-frills exe files, and the results are better than R. There are multiple parsing options (check the --help switch). I've always had the best results using -layout and -table switches depending on how the document is formatted. Once converted, one would use regex and logic to parse and ensure no data is lost in conversion.
These are command-line paramaters, so remember to use "quotes" if there is a space in the path or file name. Test in the shell before putting into Alteryx.
pdftotext.exe -layout file.pdf file.txt
or
pdftotext.exe -table file.pdf file.txt
To use with Alteryx, you'd just set up the run tool and read in the text results using either the run tool itself, or the blob tool in special cases if the results are mangled by the Alteryx tool. I have this deployed on the gallery at work.