Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Parse data from multiple PDF files

Dave
8 - Asteroid

Hi all

 

Ive built a number of workflows using the various PDF tools available in Alteryx to rename files for upload to internal servers. 

 

Most of these workflows apply to between 20-2000 docs. The read the pdf, find the relevant id and then rename the file with a predefined structure using the id. 

 

Ive just been asked to apply one of these builds to a folder with 9000 documents. Its taking forever to parse the data as I would expect. I would like to cut this down as much as possible. So I have 2 questions

 

1- Is there a way to tell the PDF tool to just read the first line of data on each document (where the identifier is located) and move on to the next document? Ive tried the PDF input tool and the image reader tools and I cant see a way to do this, but I thought I would ask

2- If not can you recommend a tool that will do the heavy lifting in terms of data scraping the pdfs as quickly as possible. 

 

Any help would be greatly appreciated

 

Dave

2 REPLIES 2
gautiergodard
13 - Pulsar

Hey @Dave 

To answer your questions:

1) Yes, you can specify a region of a pdf that you would like to read by using the "Image Template" tool within the Computer Vision tool pallet.

2) If you are processing system generated pdfs (not scanned copies of documents that are images) Alteryx recently release a new PDF to Text tool that greatly increases the accuracy and speed of extraction. Including the link to this new tool here for your reference: PDF to Text | Alteryx Help

 

Hope this helps!

Dave
8 - Asteroid

Thats very interesting - Ill give the new tool a whirl, thank you

Labels