Over the past few weeks, our team has posted some pretty interesting (well, we hope anyway) articles on how to utilize the Python Code tool for parsing Words and PDFs.
In case you have missed any of those, check out How to use R and Python to Parse Word Documents and Parsing Text From PDF Documents with Python Code Tool
This article just piles on with yet another interesting feature to extend the endless possibilities of Alteryx Designer.
Intro
We are utilizing the Python Code tool within Alteryx Designer together with just recently announced Camelot package for Python to parse tabular data from PDFs.
Being able to parse text alone from PDFs is a great thing. It may come as a bit of a nightmare to try parsing tabular data from your PDF documents though. Camelot package and a workflow from this post should allow you overcome these issues.
On many occasions, I have seen customers storing PDFs generated by legacy reporting platforms that are still out there on many shared drives or SharePoints.
The PDF format has no internal representation of a table structure, which makes it difficult to extract tables for analysis. A lot of open data is stored in PDFs, which was not designed for tabular data in the first place.
Camelot, a Python library and command-line tool, makes it easy for anyone to extract data tables trapped inside PDF files.
You can check out the documentation at Read the Docs and follow the development on GitHub.
Solution
We use the Python Code tool with Camelot and Pandas package to extract tabular data from PDF. No "free" text is actually being parsed. The aim is just the tabular data.
There are actually two outputs from the Python tool. Output 1 with the actual tabular data. Output 2 with a report on how successful our processing was.
You can find the workflow at the bottom of this post together with the sample "foo.pdf" that contains some sample tabular data for testing.
Make sure that you read the installation notes at the bottom of this post.
Disclaimer: Camelot only works with text-based PDFs and not scanned documents. (If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based.)
Note2: Make sure you specify the path to your file in the Python tool. Should be easy. Shame on me, I could have wrapped this into a Macro tool like.
Also, the workflow may lose the code in the Python Code tool once you open it on your PC. This seems to be a bug in 2018-3 version of the Code tool...
If you have any problems opening it, just let me know and I can send you a workflow in ZIP file which seems to solve the problem.
Or you can just use the code from within the post to copy paste it into your tool.
Python Code
#Need the Alteryx package from ayx import Alteryx #Install Camelot Package for PDF tabular data parsing Alteryx.installPackages("camelot-py[all]")
#Parse the tabular data
import camelot #specify the path to your PDF document tables = camelot.read_pdf('//Mac/Google Drive/__Alteryx/foo.pdf') #Get the dataframe from the PDF table data df=tables[0].df #Write the dataframe with tabular data to the tool output number 1 Alteryx.write(df,1)
#Get the parsing report
import pandas #Get the parsing report parsing_report=tables[0].parsing_report #Turn the dictionary based parsing report into Pandas df df_parsing_report = pandas.DataFrame.from_dict(parsing_report,orient='index',columns=['Value']) #Assign values from Index to a new measure column df_parsing_report['Measure'] = df_parsing_report.index #Write the dataframe with parsing report to the tool output number 2 Alteryx.write(df_parsing_report,2)
Installation and Prerequisites
Make sure that you follow the instructions on how to properly install Camelot package for Python. In case you need it, there is also documentation at github project.
I suggest you install Python 3 to begin with. Then, you’ll need to install the dependencies for Camelot package, which include Tkinter and ghostscript.
These can be installed using your system’s package manager.
Tkinter should be packaged up with Python 3 installer, but you can also download it from ActiveTcl Community Edition from ActiveState.
For ghostscript, you can get the installer at the ghostscript downloads page.
After installing ghostscript, you’ll need to reboot. Seriously - do it! if you are like me you don't. But this is needed here and would have saved me a long long time.
Also, your system must have ghostscript executable’s path is in the windows PATH environment variable.
PATH to GS was not automatically registered in my case. Check the ENV variables for something like "C:\Program Files\gs\gs9.25\bin".
Verify successful installation of GS from command line interface:
C:\> gswin64c.exe -version
I would also suggest you run the Python Tool in Alteryx in interactive mode (i.e. by RUN button in Python Code tool directly).
For whatever reason, pip installation of Camelot package failed once or twice before succeeding. Probably some hiccup on dependencies.
Why Camelot?
Closing line
Hopefully, this will do the trick. Make sure you follow the install instructions above! Let me know how successful you were with deploying and running this tool.
Cheers!
hey @fgilbonio,
could you please send me a printscreen of your full screen with the error message from the python tool?
together with the full exception message you are getting from the tool (from within the Python tool window).
plus the code you are currently using within the Python tool - we have quite a few versions now within the post.
d
Hi!. Sure!
This is the full message in python tool windows
"Python (2) ---------------------------------------------------------------------------¶IndexError Traceback (most recent call last)¶<ipython-input-3-f63f3444c213> in <module>¶ 5 ¶ 6 #Get the dataframe from the PDF table data¶----> 7 df=tables[0].df¶ 8 ¶ 9 #Write the dataframe with tabular data to the tool output number 1¶c:\users\franco\appdata\local\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\camelot\core.py in __getitem__(self, idx)¶ 638 ¶ 639 def __getitem__(self, idx):¶--> 640 return self._tables[idx]¶ 641 ¶ 642 @staticmethod¶IndexError: list index out of range¶"
This is the full error message:
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-3-f63f3444c213> in <module> 5 6 #Get the dataframe from the PDF table data ----> 7 df=tables[0].df 8 9 #Write the dataframe with tabular data to the tool output number 1 c:\users\franco\appdata\local\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\camelot\core.py in __getitem__(self, idx) 638 639 def __getitem__(self, idx): --> 640 return self._tables[idx] 641 642 @staticmethod IndexError: list index out of range
This is all the python code
In[1]:
#Need the Alteryx package
from ayx import Alteryx
In[2]:
#Install Camelot Package for PDF tabular data parsing
Alteryx.installPackages("camelot-py[all]")
In[3]:
import camelot
#specify the path to your PDF document
tables = camelot.read_pdf('C:\\Pdf\\evale2.pdf')
#Get the dataframe from the PDF table data
df=tables[0].df
#Write the dataframe with tabular data to the tool output number 1
Alteryx.write(df,1)
In[4]:
import pandas
#Get the parsing report
parsing_report=tables[0].parsing_report
#Turn the dictionary based parsing report into Pandas df
df_parsing_report = pandas.DataFrame.from_dict(parsing_report,orient='index',columns=['Value'])
#Assign values from Index to a new measure column
df_parsing_report['Measure'] = df_parsing_report.index
#Write the dataframe with parsing report to the tool output number 2
Alteryx.write(df_parsing_report,2)
I shared some screenshots of tables in pdf file.
thks
@fgilbonio i think the problem is in the formatting of your PDF. the table is not being recognized as a table it seems.
check out the following link about how the python's Camelot package works and recognizes tables.
https://camelot-py.readthedocs.io/en/master/user/how-it-works.html
even when changing the mode of reading the pdf from lattice to stream did not work for me.
i don't think for such highly formatted table, where no lines in the actual table are used the package will work.
Hi everyone,
just adding one more bit that allows you to read all tables from the whole of your PDF document.
i have been frequently asked how to modify the code to read say 3, 5, 10 or more tables with same schema/ structure. here goes:
#from ayx import Alteryx
import camelot
import pandas as pd
#specify the path to your PDF document
#need to include param pages to go beyond page 1
tables = camelot.read_pdf('foo-more-tables.pdf', pages='1-end', flavor='lattice')
#Get the dataframe from the PDF table data
output_df=pd.DataFrame()
for table in tables:
#print(table.df)
output_df = output_df.append(table.df, ignore_index = True, sort = False)
print(output_df)
#Write the dataframe with tabular data to the tool output number 1
Alteryx.write(output_df,1)
Hi @DavidM ,
I get the following message when attempting to run the code from your latest post in this thread. Any suggestions on how can I resolve it?
c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\camelot\ext\ghostscript\_gsprint.py in <module>
245 libgs = __win32_finddll()
246 if not libgs:
--> 247 raise RuntimeError("Please make sure that Ghostscript is installed")
248 libgs = windll.LoadLibrary(libgs)
249 else:
RuntimeError: Please make sure that Ghostscript is installed
@DavidM - Ignore my previous post, I re-read your original post and found the steps there.
Hi David, (@DavidM)
I am getting this error when running this script. I didn't have a problem with specifying the path to the file when using running PDF Text Parser but it seems to not work when running PDF Table Parser for some reason. Here is a screen shot of the error I get.
Hi @GilYee
I haven't used the Camelot package, and so not sure how to debug your error. Not sure if it helps your cause, but you could probably attempt using python's Tabula package to read tables in PDF. Below is a sample code for your reference:
from ayx import Package
from ayx import Alteryx
#Read from Alteryx workflow upstream
data = Alteryx.read("#1")
#Assign upstream data to a variable and give it a column name
url = data.iloc[0]["url"]
#Run this command once to install tabula package
#Package.installPackages(['tabula.py'])
import tabula
import pandas
#update the page nos. ('2-4') as per how your pdf file is structured.
df = tabula.read_pdf(url, pages = "2-4", pandas_options={'header':None})
Alteryx.write(df,1)
The output of this python code will be somewhat unstructured data of the pdf table, which you may have to clean downstream.
Hope this helps.