Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Extracting Tabular Data from PDF Documents with Python Code Tool

DavidM
Alteryx Alumni (Retired)

Over the past few weeks, our team has posted some pretty interesting (well, we hope anyway) articles on how to utilize the Python Code tool for parsing Words and PDFs.

In case you have missed any of those, check out How to use R and Python to Parse Word Documents and Parsing Text From PDF Documents with Python Code Tool

 

This article just piles on with yet another interesting feature to extend the endless possibilities of Alteryx Designer.

 

Intro

We are utilizing the Python Code tool within Alteryx Designer together with just recently announced Camelot package for Python to parse tabular data from PDFs.

Being able to parse text alone from PDFs is a great thing. It may come as a bit of a nightmare to try parsing tabular data from your PDF documents though. Camelot package and a workflow from this post should allow you overcome these issues.

 

On many occasions, I have seen customers storing PDFs generated by legacy reporting platforms that are still out there on many shared drives or SharePoints.

 

The PDF format has no internal representation of a table structure, which makes it difficult to extract tables for analysis. A lot of open data is stored in PDFs, which was not designed for tabular data in the first place.

 

Camelot, a Python library and command-line tool, makes it easy for anyone to extract data tables trapped inside PDF files.

You can check out the documentation at Read the Docs and follow the development on GitHub.

 

image.png

 

Solution

We use the Python Code tool with Camelot and Pandas package to extract tabular data from PDF. No "free" text is actually being parsed. The aim is just the tabular data.

 

There are actually two outputs from the Python tool. Output 1 with the actual tabular data. Output 2 with a report on how successful our processing was.

 

image.png

 

You can find the workflow at the bottom of this post together with the sample "foo.pdf" that contains some sample tabular data for testing.

Make sure that you read the installation notes at the bottom of this post.

 

Disclaimer: Camelot only works with text-based PDFs and not scanned documents. (If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based.)

 

Note2: Make sure you specify the path to your file in the Python tool. Should be easy. Shame on me, I could have wrapped this into a Macro tool like. 

Also, the workflow may lose the code in the Python Code tool once you open it on your PC. This seems to be a bug in 2018-3 version of the Code tool...

 

If you have any problems opening it, just let me know and I can send you a workflow in ZIP file which seems to solve the problem.

Or you can just use the code from within the post to copy paste it into your tool.

  

image.png

 

Python Code

 

#Need the Alteryx package
from ayx import Alteryx

#Install Camelot Package for PDF tabular data parsing
Alteryx.installPackages("camelot-py[all]")

 

#Parse the tabular data

import camelot #specify the path to your PDF document tables = camelot.read_pdf('//Mac/Google Drive/__Alteryx/foo.pdf') #Get the dataframe from the PDF table data df=tables[0].df #Write the dataframe with tabular data to the tool output number 1 Alteryx.write(df,1)
#Get the parsing report


import pandas #Get the parsing report parsing_report=tables[0].parsing_report #Turn the dictionary based parsing report into Pandas df df_parsing_report = pandas.DataFrame.from_dict(parsing_report,orient='index',columns=['Value']) #Assign values from Index to a new measure column df_parsing_report['Measure'] = df_parsing_report.index #Write the dataframe with parsing report to the tool output number 2 Alteryx.write(df_parsing_report,2)

Installation and Prerequisites

Make sure that you follow the instructions on how to properly install Camelot package for Python. In case you need it, there is also documentation at github project.

 

I suggest you install Python 3 to begin with. Then, you’ll need to install the dependencies for Camelot package, which include Tkinter and ghostscript.

These can be installed using your system’s package manager. 

 

Tkinter should be packaged up with Python 3 installer, but you can also download it from ActiveTcl Community Edition from ActiveState.

For ghostscript, you can get the installer at the ghostscript downloads page.

 

After installing ghostscript, you’ll need to reboot. Seriously - do it! if you are like me you don't. But this is needed here and would have saved me a long long time.

Also, your system must have ghostscript executable’s path is in the windows PATH environment variable. 

 

PATH to GS was not automatically registered in my case. Check the ENV variables for something like "C:\Program Files\gs\gs9.25\bin".

Verify successful installation of GS from command line interface: 

C:\> gswin64c.exe -version

I would also suggest you run the Python Tool in Alteryx in interactive mode (i.e. by RUN button in Python Code tool directly). 

For whatever reason, pip installation of Camelot package failed once or twice before succeeding. Probably some hiccup on dependencies.

 

Why Camelot?

  • You are in control.: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
  • Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
  • Each table is a pandas DataFrame, which seamlessly integrates with Designer
  • Export to multiple formats, including JSON, Excel, and HTML

 

Closing line

Hopefully, this will do the trick. Make sure you follow the install instructions above! Let me know how successful you were with deploying and running this tool.

 

Cheers!

David Matyas
Sales Engineer
Alteryx
65 REPLIES 65
DavidM
Alteryx Alumni (Retired)

hi @esridhar126 i think that you may need to ask your IT to add you full permissions in WINDOWS on ALTERYX folder including all the%alteryx%\bin\miniconda3\pythontool_venv and its subdirectories/ files

David Matyas
Sales Engineer
Alteryx
tochy
8 - Asteroid

Awesome! But in my case, I want to extract from several pdf files in one directory and these were my steps which didn't work.

 

I used the directory tool, and used a wild card but this didnt work. 

 

How do I go about this?

parsepdf.JPGparse pdf2.JPG

DavidM
Alteryx Alumni (Retired)

Hi @tochy,

 

I would suggest you create a batch macro which contains that Python tool reading PDF inputs.

 

A control parameter would be used to reconfigure the macro every time for every single PDF file you are trying to read.

 

You can refer to 

 

https://community.alteryx.com/t5/Alteryx-Knowledge-Base/The-Ultimate-Input-Data-Flowchart/ta-p/20480

 

and

https://www.youtube.com/watch?v=YIAbQGQ_Hkg

 

cheers,

d

 

 

David Matyas
Sales Engineer
Alteryx
A batch macro is a special kind of macro that is typically needed to process a group of records based on a control parameter. The control parameter determines which group of records will be processed through the underlying macro logic. The macro will be run from beginning to end for each control ...
tochy
8 - Asteroid

@DavidM 

 

I have an 87 page document and each page contains a table. I tried to use the iteration below but it keeps only extracting the table on the first page. Any ideas?

DavidM
Alteryx Alumni (Retired)
Hi,

Is the 80 pages all with the same pdf schema?

Does is just dont go beyond page 1 even on shorter docs that have the same schema?

Did you try the improvement of the code i suggested a few posts back on reading multiple tables?

David Matyas | Sales Engineer
Alteryx Prague, Czech Republic
Mobile: +420 725 919 975<>
Email: dmatyas@alteryx.com | www.alteryx.com<>


[cid:E061FBA7B0134CE496FF8A76CE7153A7]
David Matyas
Sales Engineer
Alteryx
tochy
8 - Asteroid

Thanks David for your time.

 

Is the 80 pages all with the same pdf schema?

83 of the 87 pages have the same schema. The pdf only contain tables.

Does is just dont go beyond page 1 even on shorter docs that have the same schema?

Yes.

Did you try the improvement of the code i suggested a few posts back on reading multiple tables?

Yes, but it still reads only page 1.

DavidM
Alteryx Alumni (Retired)
Cheers. Can you share some document where i can test this please?

David Matyas | Sales Engineer
Alteryx Prague, Czech Republic
Mobile: +420 725 919 975<>
Email: dmatyas@alteryx.com | www.alteryx.com<>


[cid:50BF7AC2AAF54A099505720ADED88A6A]
David Matyas
Sales Engineer
Alteryx
tochy
8 - Asteroid

I have sent you an email. Let me know what you think.

 

Thanks a bunch!

DavidM
Alteryx Alumni (Retired)

Hi @tochy,

 

Yeah i think i have it. I could not test this on your pdf (got filtered out by a spam filter) but tested on my foo.pdf with multiple tables across multiple pages.

 

There was a need to loop through that tables list + actually specify pages range in that camelot.read_pdf call.

 

Without those pages spec it just did not work.

 

Something like this should fix the problem

#Parse the tabular data

import camelot

#specify the path to your PDF document
#need to include param pages to go beyond page 1
tables = camelot.read_pdf('foo-more-tables.pdf', pages='1-2')  

#Get the dataframe from the PDF table data
output_number = 1

#Loop through the tables and output all of them
for table in tables:
   df = table.df
   #print(df)
   output_number+=1 
    
#Write the dataframe with tabular data to the tool output number 1
#Alteryx.write(df,1)

 

And get you something like this

 

image.png

 

From a PDF like this

 

image.png

 

 

David Matyas
Sales Engineer
Alteryx
NBart
6 - Meteoroid

And if you just want your program to run through all pages without specifying page number you can replace the last page # with 'end':

 

tables = camelot.read_pdf('foo-more-tables.pdf', pages='1-end')

 

 

Labels