Extract Data from PDF
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi all -
Not sure if this can even be done. However....
I have a .pdf document with house builder data. From page 55 is a list of all house builders and their contact details. What I would like to do is extract their company name (in the blue bar) and return their UK full postcode.
Any help would be greatly appreciated.
RDF
- Labels:
- Datasets
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @RDF25087
There is a macro here - https://community.alteryx.com/t5/Public-Community-Gallery/PDF-Input/ta-p/887038 but there are a few pre-requisites before you can run it.
Another option is if you have the Intelligence Suite Licence - https://www.alteryx.com/products/intelligence-suite which has extraction from PDF capabilities
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @DavidSkaife
Thank you for the quick reply. I doesn't look like we have the Intelligence Suite License - so I'll take a crack at the macro solution first.
RDF
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
That macro looks like it uses R to parse the PDF. IMO R does not do a good job. I've worked with PDF files a number of times in production over the past couple of decades and the best free solution that I've found is the Xpdf command-line tools. These are no-frills exe files, and the results are better than R. There are multiple parsing options (check the --help switch). I've always had the best results using -layout and -table switches depending on how the document is formatted. Once converted, one would use regex and logic to parse and ensure no data is lost in conversion.
These are command-line paramaters, so remember to use "quotes" if there is a space in the path or file name. Test in the shell before putting into Alteryx.
pdftotext.exe -layout file.pdf file.txt
or
pdftotext.exe -table file.pdf file.txt
To use with Alteryx, you'd just set up the run tool and read in the text results using either the run tool itself, or the blob tool in special cases if the results are mangled by the Alteryx tool. I have this deployed on the gallery at work.
![](/skins/images/72080B1993C0EC7A53569ADF25905C2F/responsive_peak/images/icon_anonymous_message.png)