Help with convertng PDF with images to Excel

Question

Hi!

I am having trouble with using the PDF Input macro/tool to convert PDF to excel. I am not sure if the problem is caused by the presence of images..But the result that I get looks like this..Has anyone had an issue like that before? Is there a workaround? Thank you!!

NeilFisk · Answer

I didn't have as much luck using your code.  I had the following issues:

'pages' argument isn't specified.Will extract only from page 1 by default.

ERROR: Alteryx.write(pandas_df, outgoing_connection_number):
Currently only pandas dataframes can be used to pass data to outgoing connections in Alteryx

This would be highly unfortunate if other Python packages can't be used out write out from Alteryx.  I can use different code to output to a CSV file, but that defeats the purpose of doing this within Alteryx.  Any ideas?

Regards,

Neil

OlgaK · Answer

Thank you!

PeterA · Answer

You might want to try out tabula-py. I have had a lot of luck reading in PDFs with this python library.

Here is some sample code for your Python Tool.  It takes in a directory field of the PDF; passes it to the Python Tool which reads in and parses the file

from ayx import Alteryx
Package.installPackages('tabula-py')
from tabula import read_pdf
pdf_document = Alteryx.read("#1")
FullPath = pdf_document['FullPath'].iloc[0]
parsedPDF = read_pdf(FullPath)
Alteryx.write(parsedPDF,1)

And if you want to get fancy you can specify the bounds of the table and avoid the image all together.

format is top, left, width and length distances in points from upper left corner.

parsedPDF = read_pdf(folder,area=[[100,50,400,400]])