Hi all
Ive built a number of workflows using the various PDF tools available in Alteryx to rename files for upload to internal servers.
Most of these workflows apply to between 20-2000 docs. The read the pdf, find the relevant id and then rename the file with a predefined structure using the id.
Ive just been asked to apply one of these builds to a folder with 9000 documents. Its taking forever to parse the data as I would expect. I would like to cut this down as much as possible. So I have 2 questions
1- Is there a way to tell the PDF tool to just read the first line of data on each document (where the identifier is located) and move on to the next document? Ive tried the PDF input tool and the image reader tools and I cant see a way to do this, but I thought I would ask
2- If not can you recommend a tool that will do the heavy lifting in terms of data scraping the pdfs as quickly as possible.
Any help would be greatly appreciated
Dave
Hey @Dave
To answer your questions:
1) Yes, you can specify a region of a pdf that you would like to read by using the "Image Template" tool within the Computer Vision tool pallet.
2) If you are processing system generated pdfs (not scanned copies of documents that are images) Alteryx recently release a new PDF to Text tool that greatly increases the accuracy and speed of extraction. Including the link to this new tool here for your reference: PDF to Text | Alteryx Help
Hope this helps!
Thats very interesting - Ill give the new tool a whirl, thank you