This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
This article just piles on with yet another interesting feature to extend the endless possibilities of Alteryx Designer.
We are utilizing the Python Code tool within Alteryx Designer together with just recently announced Camelot packagefor Python to parse tabular data from PDFs.
Being able to parse text alone from PDFs is a great thing. It may come as a bit of a nightmare to try parsing tabular data from your PDF documents though. Camelot package and a workflow from this post should allow you overcome these issues.
On many occasions, I have seen customers storing PDFs generated by legacy reporting platforms that are still out there on many shared drives or SharePoints.
The PDF format has no internal representation of a table structure, which makes it difficult to extract tables for analysis. A lot of open data is stored in PDFs, which was not designed for tabular data in the first place.
Camelot, a Python library and command-line tool, makes it easy for anyone to extract data tables trapped inside PDF files.
We use the Python Code tool with Camelot and Pandas package to extract tabular data from PDF. No "free" text is actually being parsed. The aim is just the tabular data.
There are actually two outputs from the Python tool. Output 1 with the actual tabular data. Output 2 with a report on how successful our processing was.
You can find the workflow at the bottom of this post together with the sample "foo.pdf" that contains some sample tabular data for testing.
Make sure that you read the installation notes at the bottom of this post.
Disclaimer: Camelot only works with text-based PDFs and not scanned documents. (If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based.)
Note2: Make sure you specify the path to your file in the Python tool. Should be easy. Shame on me, I could have wrapped this into a Macro tool like.
Also, the workflow may lose the code in the Python Code tool once you open it on your PC. This seems to be a bug in 2018-3version of the Code tool...
If you have any problems opening it, just let me know and I can send you a workflow inZIPfile which seems to solve the problem.
Or you can just use the code from within the post to copy paste it into your tool.
#Need the Alteryx package
from ayx import Alteryx
#Install Camelot Package for PDF tabular data parsing
#Parse the tabular data
#specify the path to your PDF document
tables = camelot.read_pdf('//Mac/Google Drive/__Alteryx/foo.pdf')
#Get the dataframe from the PDF table data
#Write the dataframe with tabular data to the tool output number 1
#Get the parsing report
#Get the parsing report
#Turn the dictionary based parsing report into Pandas df
df_parsing_report = pandas.DataFrame.from_dict(parsing_report,orient='index',columns=['Value'])
#Assign values from Index to a new measure column
df_parsing_report['Measure'] = df_parsing_report.index
#Write the dataframe with parsing report to the tool output number 2
After installing ghostscript, you’ll need to reboot. Seriously - do it! if you are like me you don't. But this is needed here and would have saved me a long long time.
Also, your system must have ghostscript executable’s path is in the windows PATH environment variable.
PATH to GS was not automatically registered in my case. Check the ENV variables for something like "C:\Program Files\gs\gs9.25\bin".
Verify successful installation of GS from command line interface:
I would also suggest you run the Python Tool in Alteryx in interactive mode (i.e. by RUN button in Python Code tool directly).
For whatever reason, pip installation of Camelot package failed once or twice before succeeding. Probably some hiccup on dependencies.
You are in control.: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
Badtables can be discarded based onmetricslike accuracy and whitespace, without ever having to manually look at each table.
Each table is apandas DataFrame, which seamlessly integrates with Designer
Exportto multiple formats, including JSON, Excel, and HTML
Hopefully, this will do the trick. Make sure you follow the install instructions above! Let me know how successful you were with deploying and running this tool.
@DavidM You are right, I am having some trouble with the installs. But not exactly the trouble I expected.
The error I am getting is "ImportError: Missing required dependencies ['numpy']" However it looks as if it is trying to load the required dependencies from the site-packages folder within an Alteryx folder that was updated with 2018.3.
I have Pandas, and Numpy installed through pip on this machine and I confirmed they are up to date. It just seems as if Alteryx is looking for these in the Alteryx site-packages folder, and can't find them.
@DavidM Ya, after I posted I started digging into the specifics of the Python tool, tried to use Alteryx.installPackages("numpy") and that gave me a strange error too. Thought that was as good of a sign to do a fresh install as any. Ended up reinstalling 2018.3 and that solved it.
Great write up!! Your instructions were very clear, I just had a weird glitch with the Python pre-installed packages.
@DavidM I think this will work for me. I just needed to read through a bit more of the Camelot documentation(which is great) and change the input method to 'stream' because by default Camelot is set to 'Lattice'. Need to adjust a couple of other "settings" so to speak and we're off running!
Camelot is a robust module, and being able to use this within Alteryx is really amazing.