- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Notify Moderator
Why is Parsing PDFs Difficult?
Parsing unstructured data like PDFs is generally a very challenging task. With structured data, data is easily accessible by specifying column names and row numbers, much like a grid on graph paper. In contrast, unstructured data is like a blank sketchbook, where identifying data relies heavily on geometric positioning.
How to Read PDFs in Alteryx Designer
When you need to import a PDF into Alteryx Designer, the first option you might consider is using the Intelligence Suite add-on tools. This article introduces examples of how to use these tools. Alternatively, it’s also doable to read PDF file using the Python tool. In this blog, I will introduce a PDF parsing method that uses the Python tool and the pdfminer.six
library.
Reading Data from PDFs with the Python Tool Using pdfminer.six
While I don't go into a detailed explanation of the pdfminer.six
library as it is not the main topic of this article, the key point is that this library enables you to extract not only text data from a PDF but also their coordinate information. Please note that pdfminer.six
does not support OCR functionality, so PDF text shall be readable in advance.
For example, let's try reading a PDF of a typical resume like this:
Resume
By using the pdfminer.six
library within the Python tool, some attribute of each text element are extracted as shown in the figure below. The text in the PDF is recognized as rectangular objects, and the (X, Y) coordinates of the four vertices of each rectangle are captured as X0, X1, Y0, Y1. The Angle represents the text orientation: 0° indicates horizontal text, while 90°/-90° indicates vertical text.
Coordinate information is utmost important when parsing PDF files. For example, text elements with the same X0 (or Y0) value can be deemed vertically (or horizontally) aligned, which may help in reconstructing tables.
Creating Spatial Objects from Coordinate Data
Here is a more advanced example. Although we were able to extract text data as described earlier, it’s still difficult to determine which records are the target data we need. To address this, spatial analysis can be utilized. This approach is similar to the concept behind the Image Template Tool. In this method, you need to pre-mark the areas containing the data you want to extract on a template PDF. Then, by matching this template with the target PDF, you can extract the target data.
Creating the Template
First, create a template by marking the areas that contain the data you want to extract. Assuming that the same format repeats across multiple pages in the same PDF file, let's use the first page as the template:
- Place annotation text boxes to cover the areas where you want to extract data on the first page.
- Enter the data labels ('Name', 'DOB', 'Gender', etc.) inside each text box
Text and Annotation Objects in PDFs
In a PDF, the text objects and the annotation objects (like the ones added in the previous step) are recognized as different types of objects. Here is the macro to automate the process of extracting these objects separately and retrieving their text and coordinate information.
- T Anchor: Text objects → text data and rectangular spatial objects in the all pages
- A Anchor: Annotation objects → annotation text and rectangular spatial objects in the template
Next, the spatial objects from the T Anchor and A Anchor are matched using Spatial Match tool. This enables you to extract the match text that exists within the area of the annotation objects on the template. See the configuration window to visually know how the target texts are captured.
For example, grouping by page and using the Crosstab tool can transform the data into a structured format like the table below.
Notes on Using the Spatial PDF Macro
The workflow (WF) introduced herein is shared as a YXZP format as attached. When using the macro 'Spatial PDF.yxmc' in the workflow for the first time on your computer, you’ll need to install the required Python libraries. Please follow these steps:
-
Launch Alteryx Designer with Administrator Privileges.
- Right-click the Alteryx icon → Select Run as administrator
-
Extract the
.yxzp
and open the workflow (WF).- In the Spatial PDF configuration screen, select 'Library Install Mode', then run the workflow.
If no errors occur after execution, the library installation was successful.
If an error appears, check the following:
- Is the Python library installation command (pip) being blocked? Office networks may block it via proxy settings, so consult your IT administrator if this is the case.
- Be sure again that Alteryx Designer is running with administrator privileges.
Conclusion
In this blog, the method for parsing PDF data using Python tools and spatial tools is introduced. This approach is effective when extracting data that consistently repeats in the same location across all pages in a PDF file. Additionally, even without using spatial objects, coordinate information could be applied in various ways, such as reconstructing tables.
As for PDF parsing, it does not have a universal 'best' solution. I encourage you to explore a wide range of options, including the features of the Intelligence Suite, to determine what could be the best practice for your needs.
Notes on the Workflow Shared in This Blog:
- The WF shared in this blog is aiming for technical demonstration purposes and its functionality is not guaranteed. Please use and modify it at your own risk.
- Unable to provide technical support for errors/issues, and not intended to update it even if any feature request is received.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.