Join the Inspire AMA with Joshua Burkhow, March 31-April 4. Ask, share, and connect with the Alteryx community!

Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.
gawa
16 - Nebula
16 - Nebula

Why is Parsing PDFs Difficult?

 

Parsing unstructured data like PDFs is generally a very challenging task. With structured data, data is easily accessible by specifying column names and row numbers, much like a grid on graph paper. In contrast, unstructured data is like a blank sketchbook, where identifying data relies heavily on geometric positioning.

 

How to Read PDFs in Alteryx Designer

 

When you need to import a PDF into Alteryx Designer, the first option you might consider is using the Intelligence Suite add-on tools. This article introduces examples of how to use these tools. Alternatively, it’s also doable to read PDF file using the Python tool. In this blog, I will introduce a PDF parsing method that uses the Python tool and the pdfminer.six library.

 

Reading Data from PDFs with the Python Tool Using pdfminer.six

 

While I don't go into a detailed explanation of the pdfminer.six library as it is not the main topic of this article, the key point is that this library enables you to extract not only text data from a PDF but also their coordinate information. Please note that pdfminer.six does not support OCR functionality, so PDF text shall be readable in advance.

 

For example, let's try reading a PDF of a typical resume like this:

 

ResumeResume

 

By using the pdfminer.six library within the Python tool, some attribute of each text element are extracted as shown in the figure below. The text in the PDF is recognized as rectangular objects, and the (X, Y) coordinates of the four vertices of each rectangle are captured as X0, X1, Y0, Y1. The Angle represents the text orientation: indicates horizontal text, while 90°/-90° indicates vertical text.

 

Coordinate information is utmost important when parsing PDF files. For example, text elements with the same X0 (or Y0) value can be deemed vertically (or horizontally) aligned, which may help in reconstructing tables.

 

image.png

 

Creating Spatial Objects from Coordinate Data

 

Here is a more advanced example. Although we were able to extract text data as described earlier, it’s still difficult to determine which records are the target data we need. To address this, spatial analysis can be utilized. This approach is similar to the concept behind the Image Template Tool. In this method, you need to pre-mark the areas containing the data you want to extract on a template PDF. Then, by matching this template with the target PDF, you can extract the target data.

 

Creating the Template

 

First, create a template by marking the areas that contain the data you want to extract. Assuming that the same format repeats across multiple pages in the same PDF file, let's use the first page as the template:

  1. Place annotation text boxes to cover the areas where you want to extract data on the first page.
  2. Enter the data labels ('Name', 'DOB', 'Gender', etc.) inside each text box

 

image.png

 

Text and Annotation Objects in PDFs

 

In a PDF, the text objects and the annotation objects (like the ones added in the previous step) are recognized as different types of objects. Here is the macro to automate the process of extracting these objects separately and retrieving their text and coordinate information.

 

image.png

 

image.png

 

  • T Anchor: Text objects → text data and rectangular spatial objects in the all pages
  • A Anchor: Annotation objects → annotation text and rectangular spatial objects in the template

 

Next, the spatial objects from the T Anchor and A Anchor are matched using Spatial Match tool. This enables you to extract the match text that exists within the area of the annotation objects on the template. See the configuration window to visually know how the target texts are captured.

 

image.png

 

For example, grouping by page and using the Crosstab tool can transform the data into a structured format like the table below.

 

image.png

 

Notes on Using the Spatial PDF Macro

 

The workflow (WF) introduced herein is shared as a YXZP format as attached. When using the macro 'Spatial PDF.yxmc' in the workflow for the first time on your computer, you’ll need to install the required Python libraries. Please follow these steps:

  1. Launch Alteryx Designer with Administrator Privileges.

    • Right-click the Alteryx icon → Select Run as administrator
  2. Extract the .yxzp and open the workflow (WF).

    • In the Spatial PDF configuration screen, select 'Library Install Mode', then run the workflow.

image.png

 

If no errors occur after execution, the library installation was successful.

 

If an error appears, check the following:

  • Is the Python library installation command (pip) being blocked? Office networks may block it via proxy settings, so consult your IT administrator if this is the case.
  • Be sure again that Alteryx Designer is running with administrator privileges.

 

Conclusion

 

In this blog, the method for parsing PDF data using Python tools and spatial tools is introduced. This approach is effective when extracting data that consistently repeats in the same location across all pages in a PDF file. Additionally, even without using spatial objects, coordinate information could be applied in various ways, such as reconstructing tables.

 

As for PDF parsing, it does not have a universal 'best' solution. I encourage you to explore a wide range of options, including the features of the Intelligence Suite, to determine what could be the best practice for your needs.

 

Notes on the Workflow Shared in This Blog:

  • The WF shared in this blog is aiming for technical demonstration purposes and its functionality is not guaranteed. Please use and modify it at your own risk.
  • Unable to provide technical support for errors/issues, and not intended to update it even if any feature request is received.
Comments
BS_THE_ANALYST
14 - Magnetar
14 - Magnetar

Oh, very cool @gawa! What an interesting way to think about parsing pdfs. This was great read 👏.

clmc9601
13 - Pulsar
13 - Pulsar

This is awesome! Thanks, @gawa!