community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Extracting Tabular Data from PDF Documents with Python Code Tool

Alteryx
Alteryx

hi @esridhar126 i think that you may need to ask your IT to add you full permissions in WINDOWS on ALTERYX folder including all the%alteryx%\bin\miniconda3\pythontool_venv and its subdirectories/ files

David Matyas
Sales Engineer
Alteryx
Asteroid

Awesome! But in my case, I want to extract from several pdf files in one directory and these were my steps which didn't work.

 

I used the directory tool, and used a wild card but this didnt work. 

 

How do I go about this?

parsepdf.JPGparse pdf2.JPG

Alteryx
Alteryx

Hi @tochy,

 

I would suggest you create a batch macro which contains that Python tool reading PDF inputs.

 

A control parameter would be used to reconfigure the macro every time for every single PDF file you are trying to read.

 

You can refer to 

 

https://community.alteryx.com/t5/Alteryx-Knowledge-Base/The-Ultimate-Input-Data-Flowchart/ta-p/20480

 

and

https://www.youtube.com/watch?v=YIAbQGQ_Hkg

 

cheers,

d

 

 

David Matyas
Sales Engineer
Alteryx
A batch macro is a special kind of macro that is typically needed to process a group of records based on a control parameter. The control parameter determines which group of records will be processed through the underlying macro logic. The macro will be run from beginning to end for each control ...
Asteroid

@DavidM 

 

I have an 87 page document and each page contains a table. I tried to use the iteration below but it keeps only extracting the table on the first page. Any ideas?

Alteryx
Alteryx
Hi,

Is the 80 pages all with the same pdf schema?

Does is just dont go beyond page 1 even on shorter docs that have the same schema?

Did you try the improvement of the code i suggested a few posts back on reading multiple tables?

David Matyas | Sales Engineer
Alteryx Prague, Czech Republic
Mobile: +420 725 919 975<>
Email: dmatyas@alteryx.com | www.alteryx.com<>


[cid:E061FBA7B0134CE496FF8A76CE7153A7]
David Matyas
Sales Engineer
Alteryx
Asteroid

Thanks David for your time.

 

Is the 80 pages all with the same pdf schema?

83 of the 87 pages have the same schema. The pdf only contain tables.

Does is just dont go beyond page 1 even on shorter docs that have the same schema?

Yes.

Did you try the improvement of the code i suggested a few posts back on reading multiple tables?

Yes, but it still reads only page 1.

Alteryx
Alteryx
Cheers. Can you share some document where i can test this please?

David Matyas | Sales Engineer
Alteryx Prague, Czech Republic
Mobile: +420 725 919 975<>
Email: dmatyas@alteryx.com | www.alteryx.com<>


[cid:50BF7AC2AAF54A099505720ADED88A6A]
David Matyas
Sales Engineer
Alteryx
Asteroid

I have sent you an email. Let me know what you think.

 

Thanks a bunch!

Alteryx
Alteryx

Hi @tochy,

 

Yeah i think i have it. I could not test this on your pdf (got filtered out by a spam filter) but tested on my foo.pdf with multiple tables across multiple pages.

 

There was a need to loop through that tables list + actually specify pages range in that camelot.read_pdf call.

 

Without those pages spec it just did not work.

 

Something like this should fix the problem

#Parse the tabular data

import camelot

#specify the path to your PDF document
#need to include param pages to go beyond page 1
tables = camelot.read_pdf('foo-more-tables.pdf', pages='1-2')  

#Get the dataframe from the PDF table data
output_number = 1

#Loop through the tables and output all of them
for table in tables:
   df = table.df
   #print(df)
   output_number+=1 
    
#Write the dataframe with tabular data to the tool output number 1
#Alteryx.write(df,1)

 

And get you something like this

 

image.png

 

From a PDF like this

 

image.png

 

 

David Matyas
Sales Engineer
Alteryx
Highlighted
Meteoroid

And if you just want your program to run through all pages without specifying page number you can replace the last page # with 'end':

 

tables = camelot.read_pdf('foo-more-tables.pdf', pages='1-end')

 

 

Labels