Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Extracting Tabular Data from PDF Documents with Python Code Tool

Highlighted
Alteryx
Alteryx

Hi @esridhar126, you will need to test that. I believe what should happen is that all tabular data should be read and parsed from that Python tool.

David Matyas
Sales Engineer
Alteryx
Highlighted
7 - Meteor

It pop up a error as, 

index Error: index out of range.

 

I can't able to proceed with multiple tables. Any ideas will be appreciable.

Highlighted
Alteryx
Alteryx

Can you please share any PDF sample that you run this against?

 

The sample python script i shared is pretty much just outputting the first table found in the document.

 

#Get the dataframe from the PDF table data
df=tables[0].df

So we may try to iterate through that data frame and just output all the results at once.

 

I may not have time do this within next few days, so if you want to try google how to loop through the DF and output all results at once. Should be straightforward.

 

If not working, we may need to split the PDF into multiple pieces if not directly supported by CAMELOT package.

David Matyas
Sales Engineer
Alteryx
Highlighted
7 - Meteor

Sure, i will try this and update ASAP. 

Thanks for your suggestions, it will be very helpful.

 

Regards, 

Sridhar 

Highlighted
7 - Meteor

Hi @david,

I attached a pdf file with two tables, it not work as expected. 

Can you please help me, to solve it out.

 

I got an output after looping, but cannot able to store it as an csv file, in separate sheets.

I am doing it using python.

Highlighted
Alteryx
Alteryx

@esridhar126 sure thing, let me try to check it out when i have a moment. can you share the script you created for looping through the data frame?

David Matyas
Sales Engineer
Alteryx
Highlighted
7 - Meteor

Here is my code,  same code with minor change.(loop)

 

import camelot
tables = camelot.read_pdf('two2.pdf')
for i in (0, len(tables) - 1):
tables.export('two2.csv', f='csv', compress=True) # json, excel, html
tables[i].parsing_report
tables[i].to_csv('two2.csv') # to_json, to_excel, to_html
print(type(tables[i].df)) # get a pandas DataFrame!

Highlighted
Alteryx
Alteryx

@esridhar126 something like this below should help. Actually, the table is a list so the iteration is done differently.


This is a simple code that just goes through the table object and parses every single table to different output...

 

I mean table 1 -> output 1, table 2-> output 2 etc.

 

import camelot

#specify the path to your PDF document
tables = camelot.read_pdf('Z:/Google Drive/Alteryx/foo2.pdf')  

output_number = 1

for table in tables:
   df = table.df
   Alteryx.write(df,output_number)
   output_number+=1 
    
#Get the dataframe from the PDF table data

#Write the dataframe with tabular data to the tool output number 1

This is not exactly elegant but i had a long day :-D

 

Also, you could just do quite a bit by just specifying the page to focus on directly in the read_pdf call...

 

camelot.read_pdf('your.pdf', pages='1,2,3')

 

David Matyas
Sales Engineer
Alteryx
Highlighted
7 - Meteor

I am getting this error, what should i do? i clear this error,

 

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-ec9eff74ca39> in <module>()
----> 1 from ayx import Alteryx

ModuleNotFoundError: No module named 'ayx'
Highlighted
Alteryx
Alteryx

Hi @esridhar126 

It almost seems there is something wrong with loading packages (AYX is the default Alteryx package) from your local python deployment of miniconda (part of alteryx install folder).

 

Can you try to run with elevated privileges?

 

Or can you check if you add a new Python tool workflow to a new workflow for a test, then run - will that cause the same error?

David Matyas
Sales Engineer
Alteryx
Labels