What I want to show you today is how easy it is to combine Alteryx Intelligence Suite with a couple of macros to automate the whole PDF ingestion process. Don't worry, I've built the macros for you, and they're ready to drag and drop into your workflow!
First things first, let me quickly explain how PDF reading works in Alteryx Intelligence Suite. The most common way to ingest a PDF document is through a 'template,' which allows you to specify the exact information you want to extract from your documents by using a single document as a template and dragging a selection over the elements you are interested in.
This idea can be seen within the Image Template tool, as seen above. You can see that I have selected a PDF document to use as a template, and I can begin to select parts of the document:
Replicating this process many times across the same document in order to build up my template:
And finally, combining with the Image to Text tool, which now outputs the following columns for each PDF:
One feature I love about this is that you can import and export these templates, saving a bank of different templates related to the types of PDF documents you receive, and switching between them for your different processes.
This functionality got me thinking...
Automating the whole process
I love that I can import and export templates, but why stop there? We can automate the whole process, and I've built the tool for it! You can download my PDF Template Macro here. This will allow you to specify the location of your PDF documents and the location of your extracted templates created in the Image Template Tool. This will then loop through all of these using the power of batch and iterative macros to match your PDF documents against all known templates.
This completely removes the need for duplicate tools on a canvas or navigating between multiple templates to manually investigate which works best. Why not try them all automatically?
The macro has been built with speed and automation in mind. This has been achieved in two ways:
Once a PDF has successfully fit a template, it is removed from the process, so it is not processed on additional iterations. This vastly increases the performance of the tool.
The tool will also output results that do not match a template. This allows for manual intervention to be requested where needed, for instance, if a new template is required or a specific PDF document needs investigation.
The PDF to Text tool is here! This tool enables customers to extract data directly from the PDF binary, giving users the fastest and most accurate way to pull data off system-generated PDFs. Read this in-depth blog by Alteryx Data Scientist Emily Van Ark as she goes through how this tool works.