community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Input PDFs (they are images)

Asteroid

Hello all!

 

I am looking for a way to input a PDF which is scanned as an image into alteryx. I cannot seem to find anything which will do the job.

 

The purpose of this is we get hand written order forms faxed (yes faxed) to us, these are then converted into a pdf. I want to take that PDF image, upload it to either azure's OCR or googles then parse the result back so we can get something useful. We have tested azure's OCR hand written stuff and it works great.

 

This is my plan anyway, if there is a easier way to do this please let me know or if this wont work please also let me know.

 

Thank you!

 

Cheers

 

Chris

Alteryx Partner
Alteryx Partner

Hi @clant,

 

As far as I know the designer has no tool available to read pdf. The only solution I come up with would be this one:

 

https://community.alteryx.com/t5/Alteryx-Knowledge-Base/Can-Alteryx-Parse-A-Word-Doc-Or-PDF/ta-p/115...

 

Maybe azure has some way to automate the load and read of those files.

 

Good luck with it :)

Asteroid

Hi,

 

I have made progress with this using python however am stuck at something which hopefully im just being stupid on.

 

So far I have managed to get something static working using pdf2image, Pillow and Poppler. 

 

I have then got the below

from pdf2image.exceptions import (
PDFInfoNotInstalledError,
PDFPageCountError,
PDFSyntaxError
)
pages = convert_from_path('M:\pdftest.pdf')
for page in pages:
page.save('M:\out2.jpg','JPEG')

 
This takes the pdf from M:\pdftest.pdf and exports the jpeg M:\out2.jpg.
What I now want to do is use a text input tool to tell it where i want it to find the PDF and i want to declare where i want it outputting. Can anyone please help? I have played around with various things and i just cannot get it to work.
 
Thanks
 
Chris
 
 
 
Alteryx Partner
Alteryx Partner

Was going to try something similar this weekend and post my results but nice work there! :)

 

I would recommend you to import the file directly from python (bother later about doing it more dynamically).

 

Dont know if this method will work but have you tried to input the file with the directory tool? Input the file path directly to the python tool as a variable. After python will look for the file and read it.

Alteryx Partner
Alteryx Partner

Here is another approach, use the command tool to extract the images directly into a folder to uplad them later to your cloud:

 

This is the complete explanation of how the method works, you just have to insert the bat file into your workflow and give it the input :)

 

https://www.experts-exchange.com/videos/215/Xpdf-PDFimages-Extract-Images-from-PDF-Files.html

 

Cheers!

Asteroid

Hi @afv2688 

 

Thanks for your suggestion on this. my idea was to get it running with a input tool then i can easily turn it into a macro and feed it from a directory tool as I expect it might not like having a few hundred pdfs thrown at it every hour.

 

Ill do some more digging in my spare time on this and let you know my solution and if i have any break through. 

 

cheers

 

chris

Labels