Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Help with convertng PDF with images to Excel

OlgaK
5 - Atom

Hi!

I am having trouble with using the PDF Input macro/tool to convert PDF to excel. I am not sure if the problem is caused by the presence of images..But the result that I get looks like this..Has anyone had an issue like that before? Is there a workaround? Thank you!!

PDF Input.JPG

3 REPLIES 3
PeterA
Alteryx Alumni (Retired)

You might want to try out tabula-py. I have had a lot of luck reading in PDFs with this python library.

Here is some sample code for your Python Tool.  It takes in a directory field of the PDF; passes it to the Python Tool which reads in and parses the file

 

from ayx import Alteryx
Package.installPackages('tabula-py')
from tabula import read_pdf
pdf_document = Alteryx.read("#1")
FullPath = pdf_document['FullPath'].iloc[0]
parsedPDF = read_pdf(FullPath)
Alteryx.write(parsedPDF,1)

And if you want to get fancy you can specify the bounds of the table and avoid the image all together.

format is topleft, width and length distances in points from upper left corner.

 

parsedPDF = read_pdf(folder,area=[[100,50,400,400]])

 

OlgaK
5 - Atom

Thank you!

NeilFisk
9 - Comet

I didn't have as much luck using your code.  I had the following issues:

 

'pages' argument isn't specified.Will extract only from page 1 by default.

 

ERROR: Alteryx.write(pandas_df, outgoing_connection_number):
Currently only pandas dataframes can be used to pass data to outgoing connections in Alteryx

This would be highly unfortunate if other Python packages can't be used out write out from Alteryx.  I can use different code to output to a CSV file, but that defeats the purpose of doing this within Alteryx.  Any ideas?

 

Regards,

Neil 

Labels