We’ve extended Inspire Early Bird Pricing until March 1. Register now and enjoy 20% off conference passes and 10% off training passes. P.S. Don’t forget to bring friends! When you sign up for five or more tickets, you get an extra 20% discount on conference passes. Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Extracting Tabular Data from PDF Documents with Python Code Tool

DavidM
Alteryx
Alteryx

Over the past few weeks, our team has posted some pretty interesting (well, we hope anyway) articles on how to utilize the Python Code tool for parsing Words and PDFs.

In case you have missed any of those, check out How to use R and Python to Parse Word Documents and Parsing Text From PDF Documents with Python Code Tool

 

This article just piles on with yet another interesting feature to extend the endless possibilities of Alteryx Designer.

 

Intro

We are utilizing the Python Code tool within Alteryx Designer together with just recently announced Camelot package for Python to parse tabular data from PDFs.

Being able to parse text alone from PDFs is a great thing. It may come as a bit of a nightmare to try parsing tabular data from your PDF documents though. Camelot package and a workflow from this post should allow you overcome these issues.

 

On many occasions, I have seen customers storing PDFs generated by legacy reporting platforms that are still out there on many shared drives or SharePoints.

 

The PDF format has no internal representation of a table structure, which makes it difficult to extract tables for analysis. A lot of open data is stored in PDFs, which was not designed for tabular data in the first place.

 

Camelot, a Python library and command-line tool, makes it easy for anyone to extract data tables trapped inside PDF files.

You can check out the documentation at Read the Docs and follow the development on GitHub.

 

image.png

 

Solution

We use the Python Code tool with Camelot and Pandas package to extract tabular data from PDF. No "free" text is actually being parsed. The aim is just the tabular data.

 

There are actually two outputs from the Python tool. Output 1 with the actual tabular data. Output 2 with a report on how successful our processing was.

 

image.png

 

You can find the workflow at the bottom of this post together with the sample "foo.pdf" that contains some sample tabular data for testing.

Make sure that you read the installation notes at the bottom of this post.

 

Disclaimer: Camelot only works with text-based PDFs and not scanned documents. (If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based.)

 

Note2: Make sure you specify the path to your file in the Python tool. Should be easy. Shame on me, I could have wrapped this into a Macro tool like. 

Also, the workflow may lose the code in the Python Code tool once you open it on your PC. This seems to be a bug in 2018-3 version of the Code tool...

 

If you have any problems opening it, just let me know and I can send you a workflow in ZIP file which seems to solve the problem.

Or you can just use the code from within the post to copy paste it into your tool.

  

image.png

 

Python Code

 

#Need the Alteryx package
from ayx import Alteryx

#Install Camelot Package for PDF tabular data parsing
Alteryx.installPackages("camelot-py[all]")

 

#Parse the tabular data

import camelot #specify the path to your PDF document tables = camelot.read_pdf('//Mac/Google Drive/__Alteryx/foo.pdf') #Get the dataframe from the PDF table data df=tables[0].df #Write the dataframe with tabular data to the tool output number 1 Alteryx.write(df,1)
#Get the parsing report


import pandas #Get the parsing report parsing_report=tables[0].parsing_report #Turn the dictionary based parsing report into Pandas df df_parsing_report = pandas.DataFrame.from_dict(parsing_report,orient='index',columns=['Value']) #Assign values from Index to a new measure column df_parsing_report['Measure'] = df_parsing_report.index #Write the dataframe with parsing report to the tool output number 2 Alteryx.write(df_parsing_report,2)

Installation and Prerequisites

Make sure that you follow the instructions on how to properly install Camelot package for Python. In case you need it, there is also documentation at github project.

 

I suggest you install Python 3 to begin with. Then, you’ll need to install the dependencies for Camelot package, which include Tkinter and ghostscript.

These can be installed using your system’s package manager. 

 

Tkinter should be packaged up with Python 3 installer, but you can also download it from ActiveTcl Community Edition from ActiveState.

For ghostscript, you can get the installer at the ghostscript downloads page.

 

After installing ghostscript, you’ll need to reboot. Seriously - do it! if you are like me you don't. But this is needed here and would have saved me a long long time.

Also, your system must have ghostscript executable’s path is in the windows PATH environment variable. 

 

PATH to GS was not automatically registered in my case. Check the ENV variables for something like "C:\Program Files\gs\gs9.25\bin".

Verify successful installation of GS from command line interface: 

C:\> gswin64c.exe -version

I would also suggest you run the Python Tool in Alteryx in interactive mode (i.e. by RUN button in Python Code tool directly). 

For whatever reason, pip installation of Camelot package failed once or twice before succeeding. Probably some hiccup on dependencies.

 

Why Camelot?

  • You are in control.: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
  • Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
  • Each table is a pandas DataFrame, which seamlessly integrates with Designer
  • Export to multiple formats, including JSON, Excel, and HTML

 

Closing line

Hopefully, this will do the trick. Make sure you follow the install instructions above! Let me know how successful you were with deploying and running this tool.

 

Cheers!

David Matyas
Sales Engineer
Alteryx
65 REPLIES 65
DavidM
Alteryx
Alteryx

Hi @GilYee,

 

Regarding the ghostscript exception you are getting please refer back to the original post:

 

"For ghostscript, you can get the installer at the ghostscript downloads page."

 

"After installing ghostscript, you’ll need to reboot. Seriously - do it! if you are like me you don't. But this is needed here and would have saved me a long long time.

Also, your system must have ghostscript executable’s path is in the windows PATH environment variable."

 

Cheers,

david

David Matyas
Sales Engineer
Alteryx
fajar_wimar
6 - Meteoroid

Hi All, I have made PDF to Tabular macros based on this post which can read multiple pdfs into tabular data.

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/PDF-to-Tabular/td-p/553429

Thank you.

fgranger
5 - Atom

Hi David,

 

Thanks for this script and your explanations. I have used this as a starting point, and I am working on a macro which feeds in parameters. I am struggling with passing in the table_areas and columns parameters, and I was wondering if you could provide any guidance, please.

 

Here is the "hardcoded" starting point, based on your script:

 

#HARDCODED PDF IMPORTER

from ayx import Alteryx

import camelot
import pandas as pd

tables = camelot.read_pdf(
    '[PATH TO PDF].pdf',
    password='',
    flavor='stream',
    row_tol=5,
    pages='all',
    table_areas=[
        '0,600,800,550',
        '0,550,800,470',
        '0,470,800,40',
        '0,40,800,0'
    ],
    columns=[
        '0,200,600,730',
        '0,60,330,440,500,600,700',
        '0,60,330,440,500,600,700',
        ''
    ]
)

output_df=pd.DataFrame()

for table in tables:
    output_df = output_df.append(table.df, ignore_index = True, sort = False)
print(output_df)

 

This functions as expected. As you can see, each PDF page has 4 table areas, and those table areas have different column schema.

 

I have tried to make this more flexible and user-friendly by driving this with interface tools as part of a macro:

 

#PDF IMPORTER WITH ALTERYX INPUTS

from ayx import Alteryx

import camelot
import pandas as pd

#Bring in Alteryx data connections
pdf_file = Alteryx.read("#1")['Value'].values[0];
password = Alteryx.read("#2")['Value'].values[0];
flavor = Alteryx.read("#3")['Value'].values[0];
pages = Alteryx.read("#4")['Value'].values[0];
row_tol = Alteryx.read("#5")['Value'].values[0];
table_areas = Alteryx.read("#6")['Value'].values[0];
columns = Alteryx.read("#7")['Value'].values[0]

tables = camelot.read_pdf(
    pdf_file,
    password=password,
    flavor=flavor,
    row_tol=row_tol,
    pages=pages,
    table_areas=table_areas,
    columns=columns
    ]
)

output_df=pd.DataFrame()

for table in tables:
    output_df = output_df.append(table.df, ignore_index = True, sort = False)
print(output_df)

 

Everything works apart from the table_areas and columns elements. I feel that this may be due to the treatment of Python lists. I have tried a simplified version with just 1 table area and its related columns (rather than 4 table areas), but that doesn't seem to work.

 

I have included the input tables below:

 

Input #1

RecordDescriptionParameterValue
1PDF file namepdf_file[PATH TO PDF].pdf

 

Input #2

RecordDescriptionParameterValue
1Passwordpassword[Null]

 

Input #3

RecordDescriptionParameterValue
1Flavorflavorstream

 

Input #4

RecordDescriptionParameterValue
1Page rangepagesall

 

Input #5

RecordDescriptionParameterValue
1Row tolerancerow_tol5

 

Input #6

I am struggling with this. I have tried both collapsing this into a string (using Summarize tools) and creating a list of lists, but neither seem to pass into Camelot correctly. Here is the raw data I am trying to feed in:

RecordParameterTable numberTable area elementValue
1table_areas1top-left x10
2table_areas1top-left y1600
3table_areas1bottom-right x2800
4table_areas1bottom-right y2550
5table_areas2top-left x10
6table_areas2top-left y1550
7table_areas2bottom-right x2800
8table_areas2bottom-right y2470
9table_areas3top-left x10
10table_areas3top-left y1470
11table_areas3bottom-right x2800
12table_areas3bottom-right y240
13table_areas4top-left x10
14table_areas4top-left y140
15table_areas4bottom-right x2800
16table_areas4bottom-right y20

 

Input #7

I am having the same issues as in Input #6 above. Here is the raw column data:

RecordParameterTable numberColumn numberValue
1

columns

110
2columns12200
3columns13600
4columns14730
5columns210
6columns2260
7columns23330
8columns24440
9columns25500
10columns26600
11columns27700
12columns310
13columns3260
14columns33330
15columns34440
16columns35500
17columns36600
18columns37700
19columns4[Null][Null]

 

Any ideas would be appreciated, as being able to drive the table areas more flexibly would make this much more powerful.

trettelap
8 - Asteroid

still troubleshooting but getting an error "no module named camelot".....installed using admin..any thoughts?

aashay511
7 - Meteor

Hi @DavidM 

 

I am trying to use the Camelot package inside the python tool in Alteryx. I have the required dependencies - Ghostscript and Tkinter. 

 

I am getting the following error when using this package: 

 

OSError                                   Traceback (most recent call last)
<ipython-input-5-f57fc83b89cc> in <module>
----> 1 tables = camelot.read_pdf(r"C:\Users\user\Downloads\Equipment_List.pdf", pages='2')
      2 
      3 #Get the dataframe from the PDF table data
      4 print(tables)
      5 

c:\program files\alteryx\bin\miniconda3\envs\designerbasetools_venv\lib\site-packages\camelot\io.py in read_pdf(filepath, pages, password, flavor, suppress_stdout, layout_kwargs, **kwargs)
    111         p = PDFHandler(filepath, pages=pages, password=password)
    112         kwargs = remove_extra(kwargs, flavor=flavor)
--> 113         tables = p.parse(    114             flavor=flavor,
    115             suppress_stdout=suppress_stdout,

c:\program files\alteryx\bin\miniconda3\envs\designerbasetools_venv\lib\site-packages\camelot\handlers.py in parse(self, flavor, suppress_stdout, layout_kwargs, **kwargs)
    174             parser = Lattice(**kwargs) if flavor == "lattice" else Stream(**kwargs)
    175             for p in pages:
--> 176                 t = parser.extract_tables(    177                     p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
    178                 )

c:\program files\alteryx\bin\miniconda3\envs\designerbasetools_venv\lib\site-packages\camelot\parsers\lattice.py in extract_tables(self, filename, suppress_stdout, layout_kwargs)
    419             return []
    420 
--> 421         self.backend.convert(self.filename, self.imagename)
    422 
    423         self._generate_table_bbox()

c:\program files\alteryx\bin\miniconda3\envs\designerbasetools_venv\lib\site-packages\camelot\backends\ghostscript_backend.py in convert(self, pdf_path, png_path, resolution)
     29     def convert(self, pdf_path, png_path, resolution=300):
     30         if not self.installed():
---> 31             raise OSError(     32                 "Ghostscript is not installed. You can install it using the instructions"
     33                 " here: https://camelot-py.readthedocs.io/en/master/user/install-deps.html"

OSError: Ghostscript is not installed. You can install it using the instructions here: https://camelot-py.readthedocs.io/en/master/user/install-deps.html

 

I tried to use the following 2 methods to see whether GhostScript is installed properly or not. 

 

1. 

PS C:\Users\user> gswin32c.exe -version
GPL Ghostscript 9.55.0 (2021-09-27)
Copyright (C) 2021 Artifex Software, Inc. All rights reserved.

 

2. 

import ctypes
from ctypes.util import find_library
find_library("".join(("gsdll", str(ctypes.sizeof(ctypes.c_voidp) * 8), ".dll")))

- This returned nothing

 

I also have the bin folder location of ghostscript set in the path variable. 

 

What other steps could I take to ensure this starts working?

NeilFisk
9 - Comet

David,

 

I tried to run the workflow and get an error on the path of the file.  I have entered what I believe to be the correct path.

 

Code is:

#################################
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
Package.installPackages(['pandas','numpy','camelot'])


#################################
from ayx import Alteryx
#from ayx import Alteryx
import camelot
import pandas as pd

#specify the path to your PDF document
#need to include param pages to go beyond page 1
tables = camelot.read_pdf('C:\Users\nfisk\Desktop\foo.pdf', pages='1-end', flavor='lattice')

#Get the dataframe from the PDF table data
output_df=pd.DataFrame()

for table in tables:
#print(table.df)
output_df = output_df.append(table.df, ignore_index = True, sort = False)

print(output_df)

#Write the dataframe with tabular data to the tool output number 1
Alteryx.write(output_df,1)

 

Error I receive is:

  File "<ipython-input-2-52acfbdb4c1a>", line 8
    tables = camelot.read_pdf('C:\Users\nfisk\Desktop\foo.pdf', pages='1-end', flavor='lattice')
                              ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

Any help would be appreciated.

 

Regards,

Neil

 

Labels