This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I am looking for a way to input a PDF which is scanned as an image into alteryx. I cannot seem to find anything which will do the job.
The purpose of this is we get hand written order forms faxed (yes faxed) to us, these are then converted into a pdf. I want to take that PDF image, upload it to either azure's OCR or googles then parse the result back so we can get something useful. We have tested azure's OCR hand written stuff and it works great.
This is my plan anyway, if there is a easier way to do this please let me know or if this wont work please also let me know.
I have made progress with this using python however am stuck at something which hopefully im just being stupid on.
So far I have managed to get something static working using pdf2image, Pillow and Poppler.
I have then got the below
from pdf2image.exceptions import ( PDFInfoNotInstalledError, PDFPageCountError, PDFSyntaxError ) pages = convert_from_path('M:\pdftest.pdf') for page in pages: page.save('M:\out2.jpg','JPEG')
This takes the pdf from M:\pdftest.pdf and exports the jpeg M:\out2.jpg.
What I now want to do is use a text input tool to tell it where i want it to find the PDF and i want to declare where i want it outputting. Can anyone please help? I have played around with various things and i just cannot get it to work.
Was going to try something similar this weekend and post my results but nice work there! :)
I would recommend you to import the file directly from python (bother later about doing it more dynamically).
Dont know if this method will work but have you tried to input the file with the directory tool? Input the file path directly to the python tool as a variable. After python will look for the file and read it.
Thanks for your suggestion on this. my idea was to get it running with a input tool then i can easily turn it into a macro and feed it from a directory tool as I expect it might not like having a few hundred pdfs thrown at it every hour.
Ill do some more digging in my spare time on this and let you know my solution and if i have any break through.