Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Data Science

Machine learning & data science for beginners and experts alike.
TheOC
15 - Aurora
15 - Aurora

What I want to show you today is how easy it is to combine Alteryx Intelligence Suite with a couple of macros to automate the whole PDF ingestion process. Don't worry, I've built the macros for you, and they're ready to drag and drop into your workflow!

 

First things first, let me quickly explain how PDF reading works in Alteryx Intelligence Suite. The most common way to ingest a PDF document is through a 'template,' which allows you to specify the exact information you want to extract from your documents by using a single document as a template and dragging a selection over the elements you are interested in.

 

TheOC_0-1672917779163.png

 

This idea can be seen within the Image Template tool, as seen above. You can see that I have selected a PDF document to use as a template, and I can begin to select parts of the document:

 

TheOC_1-1672917803861.png

 

Replicating this process many times across the same document in order to build up my template:

 

TheOC_2-1672917823052.png


And finally, combining with the Image to Text tool, which now outputs the following columns for each PDF:

 

TheOC_3-1672917839652.png

 

One feature I love about this is that you can import and export these templates, saving a bank of different templates related to the types of PDF documents you receive, and switching between them for your different processes.

 

TheOC_4-1672917855843.png

 

This functionality got me thinking...

 

Automating the whole process

 

I love that I can import and export templates, but why stop there? We can automate the whole process, and I've built the tool for it! You can download my PDF Template Macro here. This will allow you to specify the location of your PDF documents and the location of your extracted templates created in the Image Template Tool. This will then loop through all of these using the power of batch and iterative macros to match your PDF documents against all known templates. 

 

TheOC_0-1674611709011.png

 

This completely removes the need for duplicate tools on a canvas or navigating between multiple templates to manually investigate which works best. Why not try them all automatically?

 

The macro has been built with speed and automation in mind. This has been achieved in two ways:

  • Once a PDF has successfully fit a template, it is removed from the process, so it is not processed on additional iterations. This vastly increases the performance of the tool.
  • The tool will also output results that do not match a template. This allows for manual intervention to be requested where needed, for instance, if a new template is required or a specific PDF document needs investigation.

 

What are you waiting for? Download the Alteryx Intelligence Suite free trial here and my macro here.

 

The PDF to Text tool is here! This tool enables customers to extract data directly from the PDF binary, giving users the fastest and most accurate way to pull data off system-generated PDFs. Read this in-depth blog by Alteryx Data Scientist Emily Van Ark as she goes through how this tool works.

Comments
atcodedog05
22 - Nova
22 - Nova

This is great 😁. And I got this at the right time😃.

 

 Thank you @TheOC 😎

mahadevaswab
8 - Asteroid

Dear @TheOC ,

 

This is the good one to start with data extraction from PDF's.

 

however we have noticed that, if the PDF is slightly crossed or PDF's scanned upside down then the above is not reading properly,

 

is there machine learning for this or any other solution to read those cases.

 

Thanks & Regards

Mahadev

BS_THE_ANALYST
14 - Magnetar

This is awesome! I can't wait to try it.