Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Extracting Tabular Data from PDF Documents with Python Code Tool

Highlighted
Alteryx
Alteryx

hey @fgilbonio,

 

could you please send me a printscreen of your full screen with the error message from the python tool?

 

together with the full exception message you are getting from the tool (from within the Python tool window).

 

plus the code you are currently using within the Python tool - we have quite a few versions now within the post.

 

d

David Matyas
Sales Engineer
Alteryx
Highlighted
6 - Meteoroid

Hi!. Sure!

This is the full message in python tool windows

"Python (2) ---------------------------------------------------------------------------¶IndexError Traceback (most recent call last)¶<ipython-input-3-f63f3444c213> in <module>¶ 5 ¶ 6 #Get the dataframe from the PDF table data¶----> 7 df=tables[0].df¶ 8 ¶ 9 #Write the dataframe with tabular data to the tool output number 1¶c:\users\franco\appdata\local\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\camelot\core.py in __getitem__(self, idx)¶ 638 ¶ 639 def __getitem__(self, idx):¶--> 640 return self._tables[idx]¶ 641 ¶ 642 @staticmethod¶IndexError: list index out of range¶"

 

This is the full error message:

 

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-3-f63f3444c213> in <module>
      5 
      6 #Get the dataframe from the PDF table data
----> 7 df=tables[0].df
      8 
      9 #Write the dataframe with tabular data to the tool output number 1

c:\users\franco\appdata\local\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\camelot\core.py in __getitem__(self, idx)
    638 
    639     def __getitem__(self, idx):
--> 640         return self._tables[idx]
    641 
    642     @staticmethod

IndexError: list index out of range

 

Highlighted
Alteryx
Alteryx
Hi,

Thanks for the inputs. Can you please also send me the current python script you are using? There ale multiple within the post.

Seems like your pdf contains quite a few tables and we just overrun on the numver of possible outputs.

Are the tables in the pdf all the same schema? Sorry cant open it due to security restictions.

David Matyas | Sales Engineer
Alteryx Prague, Czech Republic
Mobile: +420 725 919 975<>
Email: dmatyas@alteryx.com | www.alteryx.com<>


[cid:37762E902C914EDF8F4D35B401229468]
David Matyas
Sales Engineer
Alteryx
Highlighted
6 - Meteoroid

This is all the python code

 

In[1]:

#Need the Alteryx package
from ayx import Alteryx

 

In[2]:

#Install Camelot Package for PDF tabular data parsing
Alteryx.installPackages("camelot-py[all]")

 

In[3]:

import camelot

#specify the path to your PDF document
tables = camelot.read_pdf('C:\\Pdf\\evale2.pdf')

#Get the dataframe from the PDF table data
df=tables[0].df

#Write the dataframe with tabular data to the tool output number 1
Alteryx.write(df,1)

 

In[4]:

import pandas

#Get the parsing report
parsing_report=tables[0].parsing_report

#Turn the dictionary based parsing report into Pandas df
df_parsing_report = pandas.DataFrame.from_dict(parsing_report,orient='index',columns=['Value'])

#Assign values from Index to a new measure column
df_parsing_report['Measure'] = df_parsing_report.index

#Write the dataframe with parsing report to the tool output number 2
Alteryx.write(df_parsing_report,2)

 

 

I shared some screenshots of tables in pdf file.

 

thks

 

 

Highlighted
Alteryx
Alteryx

 

@fgilbonio i think the problem is in the formatting of your PDF. the table is not being recognized as a table it seems.

 

check out the following link about how the python's Camelot package works and recognizes tables.

https://camelot-py.readthedocs.io/en/master/user/how-it-works.html

 

even when changing the mode of reading the pdf from lattice to stream did not work for me.

 

i don't think for such highly formatted table, where no lines in the actual table are used the package will work.

 

David Matyas
Sales Engineer
Alteryx
Highlighted
Alteryx
Alteryx

Hi everyone,

 

just adding one more bit that allows you to read all tables from the whole of your PDF document.

 

i have been frequently asked how to modify the code to read say 3, 5, 10 or more tables with same schema/ structure. here goes:

 

#from ayx import Alteryx
import camelot
import pandas as pd

#specify the path to your PDF document
#need to include param pages to go beyond page 1
tables = camelot.read_pdf('foo-more-tables.pdf', pages='1-end', flavor='lattice')  

#Get the dataframe from the PDF table data
output_df=pd.DataFrame()

for table in tables:
    #print(table.df)
    output_df = output_df.append(table.df,  ignore_index = True, sort = False)
    
print(output_df)    

#Write the dataframe with tabular data to the tool output number 1
Alteryx.write(output_df,1)

 

David Matyas
Sales Engineer
Alteryx
Highlighted
Alteryx Certified Partner

Hi @DavidM ,

 

I get the following message when attempting to run the code from your latest post in this thread. Any suggestions on how can I resolve it?

 

c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\camelot\ext\ghostscript\_gsprint.py in <module>
245 libgs = __win32_finddll()
246 if not libgs:
--> 247 raise RuntimeError("Please make sure that Ghostscript is installed")
248 libgs = windll.LoadLibrary(libgs)
249 else:

RuntimeError: Please make sure that Ghostscript is installed

Highlighted
Alteryx Certified Partner

@DavidM - Ignore my previous post, I re-read your original post and found the steps there.

Highlighted
5 - Atom

Hi David, (@DavidM)

 

I am getting this error when running this script.  I didn't have a problem with specifying the path to the file when using running PDF Text Parser but it seems to not work when running PDF Table Parser for some reason.  Here is a screen shot of the error I get.

 

GilYee_0-1577546680517.png

Highlighted
Alteryx Certified Partner

Hi @GilYee

 

I haven't used the Camelot package, and so not sure how to debug your error. Not sure if it helps your cause, but you could probably attempt using python's Tabula package to read tables in PDF. Below is a sample code for your reference:

 

from ayx import Package
from ayx import Alteryx

 

#Read from Alteryx workflow upstream
data = Alteryx.read("#1")

 

#Assign upstream data to a variable and give it a column name
url = data.iloc[0]["url"]

 

#Run this command once to install tabula package
#Package.installPackages(['tabula.py'])

 

import tabula
import pandas

 

#update the page nos. ('2-4') as per how your pdf file is structured.

df = tabula.read_pdf(url, pages = "2-4", pandas_options={'header':None})

 

Alteryx.write(df,1)

 

The output of this python code will be somewhat unstructured data of the pdf table, which you may have to clean downstream.

 

Hope this helps.

Labels