Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Extracting Tabular Data from PDF Documents with Python Code Tool

Highlighted
8 - Asteroid

@DavidM 

 

Thank you.

 

It didn't work exactly as I envisaged. I ran a test with different page ranges from my pdf, but saw that, in all test cases 

only the last page was read into Alteryx.

 

I really am grateful for your time.

 

PythonVoice.JPG

 

Highlighted
8 - Asteroid

@NBart 

 

Thanks for the tip.

Highlighted
Alteryx
Alteryx

Hi @tochy,

 

I can't think of a better way quickly than to write out different tables to different outputs of the Python tool through an itteration

 

 

from ayx import Alteryx
import camelot

#specify the path to your PDF document
#need to include param pages to go beyond page 1
tables = camelot.read_pdf('foo-more-tables.pdf', pages='1-2')  

output_number = 1

for table in tables:
   df = table.df
   Alteryx.write(df,output_number)
   output_number+=1     

 

 

Which is of course not super ideal and limiting to smaller PDFs with a handful of tables.

 

But if you want to have this done across huge PDF you could just replace all the original scripts with the following.
This takes the whole PDF, and pushes out as many CSVs as you have tables.

Then, you would just point an INPUT data tool using * in the input path to read all of those CSVs into Alteryx.

 

 

from ayx import Alteryx
import camelot

#specify the path to your PDF document
#need to include param pages to go beyond page 1
tables = camelot.read_pdf('foo-more-tables.pdf', pages='1-end')  

tables.export('foo.csv', f='csv', compress=False) 

 

 

In my case of 2 pages, 3 tables in a pdf this created

foo-page-1-table-1.csv

foo-page-2-table-1.csv

foo-page-2-table-2.csv

 

just push this to one empty folder, and read all in with input tool saying for instance "foo-page-*" in the path.

 

d

David Matyas
Sales Engineer
Alteryx
Highlighted
7 - Meteor

@DavidM i ran the same lines of code, but still got an error, 

"

NameError: name 'Alteryx' is not defined

"

 

As you suggested before, i installed properly (refer below pic) and also i got all the access to this folder.

 

My problem statement is, extract multiple tables as table format. 

 

Highlighted
Alteryx
Alteryx

You are just missing the Alteryx package import

#Need the Alteryx package
from ayx import Alteryx
David Matyas
Sales Engineer
Alteryx
Highlighted
7 - Meteor

@DavidM 

i tried importing

 

ModuleNotFoundError: No module named 'ayx'

got this error.

 

Highlighted
Alteryx
Alteryx

@esridhar126 i think the problem with this is still the same as I posted in a different thread and has nothing to do with the PDF parsing or anything.

 

this is caused by insufficient privileges of your Alteryx Designer installation as the Python packages cannot be loaded successfully.

 

please try to run your Alteryx with elevated priviliges.

David Matyas
Sales Engineer
Alteryx
Highlighted
7 - Meteor

Now what i did is, i just copied the ayx packages and paste it inside the conda packages folder.

now while run the above line, i got following error.

 

ModuleNotFoundError: No module named 'PyYXDBReader'
Highlighted
Alteryx
Alteryx

@esridhar126 can you please try to ping support@alteryx.com if the elevating of privileges does not work?

 

i found a manual how to do that in Win here

https://www.dummies.com/computers/operating-systems/windows-7/how-to-run-a-program-with-elevated-per...

 

cheers

David Matyas
Sales Engineer
Alteryx
Highlighted
6 - Meteoroid

Hi! This is a good contribution,

 

But I have some problem with my PDF file, I tried to run your solution with this pdf but it show me an error like thie "

IndexError: list index out of range

"

I shared my file, if is possible Could you help.

 

Thanks

Labels