Data Science

AlteryxMarco · ‎06-22-2022

by Joe Marco, Carli Edelstein (@CarliE), and Rachel Hatcherian (@rachelhatcherian)

Introduction

To date, Alteryx Intelligence Suite (Computer Vision) tools allow for a variety of PDF scraping and data extraction options from PDFs. However, there is a common need to pull out values within checkboxes in PDFs (check marks or X, etc.). In this article, we will be sharing some options available to you and some cool ways to wrap some of the steps in a macro to streamline the process.

We will assume the role of Boba Fett for this exercise, who is looking to pull some information out of his tax form submissions using Alteryx Intelligence Suite.

Source: Unsplash

Important Prerequisites

To leverage any of these options for PDF checkbox extraction, we will need to ensure the PDFs are “flattened.” This refers to the final layers of PDF creation being flattened together so that our image tools can properly detect values in checkboxes. This is a known step that happens in adobe or other PDF software.

The easiest way to tell if your PDFs are flattened is to check a few and if you can “edit them” upon opening. They likely are not flattened and have multiple layers where users can alter data/checkboxes etc. If you open our PDFs and they are flattened, nothing should be “editable.”

We recommend not using scanned PDFs in any of these options as scanned PDFS can have issues with different colors, markings, etc., which could lead to inconsistency. When annotating checkbox images, make sure to only drag/annotate inside the checkbox and do not get the box itself.

Understanding the solution(s)

We will be highlighting two solution options here that will allow a user to be able to work toward pulling checkbox values from their own PDFs:

Measuring Dark Pixels with Computer Vision Tools (more tools used, but allows for easier threshold definition)
Measuring Byte Size with Computer Vision Tools (fewer tools used, but requires some threshold setting)

Both options can be wrapped into a macro to allow the user to not have to drag in many tools for each checkbox interpretation measure.

Solution option 1: measuring dark pixels with step-by-step imagery

General highlight

We discovered that when reading in PDFs & annotating checkboxes as images, the annotated images of checkboxes could be further leveraged with Alteryx tools (specifically Image Profile) to check if there are dark pixels in the box or not. In this solution, we will be using a variety of computer vision tools, with some formulas to pull checkbox values out of PDFs with other data and convert them into a workable dataset for an end-user. This should work regardless of whether there is a check, X, or another type of marking within a checkbox.

Finished workflow overview

The workflow package is attached at the bottom of this article.

Steps

Navigate to the Computer Vision pallet and drag in an image input tool to the canvas. Connect to the PDF(s) location where your PDFs are located. If you are working with a single PDF, you will need to put this into its own folder as this tool needs a folder specification. An example of configuration can be seen below.

Navigate to the Computer Vision pallet and drag in an Image Template Tool. Here you will annotate your values in your PDF; you can annotate strings, images, or tables. We will only focus on image annotation here (aka bring the checkbox value as an image). As you can see, for checkboxes, we annotate inside of the checkbox and make sure this is labeled as an “image” in the dropdown. You will repeat this step for checkboxes across all of your PDF templates.

Navigate to the Computer Vision pallet and drag in an Image to Text Tool. Here the tool will take your annotations from your PDF and write them to data/text for further use. You will notice that our images will be output to a “byte” value. If you have multiple PDFs or multiple Pages per PDF, you will see more results than what is shown in this example.

For the example we are working with, we have one PDF in our file path with two pages. To keep things simple, we will just focus on the first page of the PDF by applying a filter to the specific page with the checkboxes.

Navigate to the Computer Vision pallet and drag in an Image Profile Tool. Connect your output from your annotation section to the Image Profile Tool. You must specify a single column in this tool to profile. In short, we are using this tool to look at the images of our checkbox annotation, and we will profile a few things about that image. The specific thing we want to profile is the dark pixel totals in the image itself (aka, does it have something in the box or not). Since the image profile tool will have many fields added to each image/checkbox, add a select after the image profile tool and select only the relevant fields—including the field with the dark pixel count. When finished, this part of your workflow should look something like the following:

As you can see in the example, our first checkbox (CB_Single) has 18 dark pixels. We can gauge from this that that checkbox has some markings in the box, which would lead us to believe it to be checked. If we cross-verify the actual PDF, we will see it is indeed checked.

Repeat this step for all the checkbox images you need to analyze. In the next section, we will show you how to wrap this in a macro, so you do not need to drag in 14+ tools to check each box.

Pull in a join multiple tool from the join palette. At this point we have all our checkboxes run through our image profile tools to verify if they have dark pixels or not, but we need to clean up a bit of the naming in the outputs to be consistent with something that is understandable to work with. Connect all the image profile/select tools to the join multiple, and join on path, file, and page to keep it consistent. You will only need to select key fields, and we will want to rename in our configuration pane the checkbox dark pixel counts to the respective checkbox values. One way I did this can be seen in the below screenshot.

Additional steps after this point are more about assigning text labels to our numeric dark pixel counts and some joining/cleansing of the final output of the PDF scrape.

Bring in a multi-field formula tool to assign binary values to the dark pixel totals. The formula I used below simply states that if the dark pixels are greater than 0, then assign a “1”; if not then keep it 0.

IF [_CurrentField_] = 0 THEN 0
ELSEIF [_CurrentField_] >0 THEN 1
ELSE [_CurrentField_]
ENDIF

Bring in another multi-field formula tool to assign text values to the 0 and 1s. I used checked or not checked as my labels, but you can use whatever wording you would like.

Results after this formula should look like the following:

In the last steps of the final container in the workflow, we are simply bringing our checkbox values back to the other fields we pulled out of our PDF using intelligence suite tools. Use a join tool on “file” to join back in the fields not related to the checkboxes.

Finally, we perform some minor data cleansing as some of our PDF scrapes have some spaces, punctuation, etc. This will give you the final scraped data (including checkboxes) from our original PDF! We added a transpose tool to place the data vertically for an easy-to-see screenshot!

Boba Fett’s taxes are now in the system and Tattooine authorities are excited to have accurate data.

Source: Unsplash

How can I automate this? With a macro!

You may be thinking, do I have to use an image profile tool and a select tool for every image field I annotated? The answer is no. By creating a macro, you can consolidate the process and apply these steps to the necessary image blob fields without repeating the same process. Let’s go through the steps of how we can do that!

Finished workflow (with macro) overview

The workflow package is attached at the bottom of this article.

Steps

Reference steps 1 – 4 in the above section. These steps need to be taken before moving forward to the macro.

Create a batch macro to replace step 6 from above.

a. Use a macro input to specify the template of the data going in.

b. Use the Image Profile tool and select a placeholder column.

Set the name of the placeholder column to be updated using a control parameter tool and an action tool.

Select necessary fields with a select tool.

Ensure the name of the field comes through by creating a formula and updating the placeholder of X with the field name.

Use a macro output tool after the formula tool to spit out results. Save the macro to be used in the workflow.

Before using the macro, we need to be able to get the field names that will be needed for the macro. To do this, we can use a dynamic select to select the blob fields created during the annotation step.

After dynamically selecting the appropriate fields, we can then use the field info tool to output the metadata of our fields.

Once we have this information, we can add the macro to the canvas. Right-click on the canvas, and select insert -> macro -> checkbox.

There are two input anchors for the batch macro:

1. Upside-down question mark: the input for the variables on the batch macro

2. D: data input

Feed in the output of the field info tool into the upside-down question mark input – we will be using the column called “Name” to batch through the different field names to replace the placeholder column from our macro with each field.

Feed in the data from the image to text tool into the D input anchor.

a. The output should then look like this:

Since the data is vertical, we need to make sure to pivot it (so that it's horizontally aligned) by using a Crosstab tool.

Reference steps 8-10 in the section above to finish up the process!

Solution option 2: measuring byte size to determine “checks” with step-by-step imagery

General highlight

Sometimes when using computer vision, the checkboxes come back as a “J.” Other times, they come back as another letter or symbol. An alternative way to extract checkboxes is to write a simple formula to check how many pixels are in that box.

Here I have a 1040 form for Boba Fett and Obi-Wan, who are both filing as single (the struggle of being a bounty hunter and a Jedi is real). We know that extracting information from tax forms can be quite taxing. I want to extract the string fields from this PDF as well as the checkboxes.

Steps

Use the Image Input to read in multiple PDF forms. We will also need our Image Template to annotate our PDF. Both of these are fed into Image to Text to translate the string and images for us.

We annotate our PDF in Image Template. For my string fields, I will use the dropdown of “String”. It is important to note that for my checkboxes I am annotating them as “Image”. This is because we will need that pixel count!

Tip: When naming your checkboxes, prefix them with “CB” to make it easier to convert later in this workflow.

Looking at the Image to Text output, we can see that the boxes that are checked are greater than 90 bytes when checked. Knowing this, we can convert these images to numeric and add a formula to indicate that when the bytes are greater than 90 pixels this box is checked.

Before we perform this calculation, we need to convert our checkbox fields. Looking at Select, we can see that all our check boxes are a Blob file type.

Our third step is to convert these Blobs to numeric, get those pixels sizes, and write out our formula. To make it easy to convert all these Blobs in one go, we first transpose our checkboxes.

Remember how we created a suffix for all those checkboxes? This is where you are going to be really happy you did!

In Transpose, select the file as your key column. Then under Data Columns, you can search for CB to quickly find all of your checkbox fields.

Once these are transposed, we can convert the blobs in the Value column. Using Blob Convert, use the below settings to convert the blob to binary data, where we can then do our calculation.

This is going to update our Value to a long string which we can then calculate the byte size.

Using Formula, you need the length of value divided by 2 to get the byte size. Since our value is a string, we wrap that in the tonumber() function. From there, we see if the byte size is greater than 90—if so, our box is checked!

IF tonumber(Length([Value])/2) > 90 THEN "Checked" ELSE "Unchecked" ENDIF

Now we’re in business! I’ve got a good feeling about this.

My fields are exactly how I want them, but I need to get them back in the right direction. Using Cross Tab I can quickly flip my data back to its original format by row.

Finally, we join it back together with our original string fields to have our completed dataset with strings and checkboxes.

Final Thoughts

What excites us most about this overview is the creative way we were able to leverage the Computer Vision capabilities within Intelligence Suite to efficiently pull out information from documents where checkbox values have been difficult to consistently extract in the past. Boba Fett can walk away with confidence that he has pulled the correct information using Intelligence Suite, and will not be chased down… at least not by anyone who is responsible for filing his tax forms!

Data Science

Star Wars & Tax Season: Using Alteryx Intelligence Suite to Pull Checkbox Values From PDFs

Introduction

Important Prerequisites

Understanding the solution(s)

Solution option 1: measuring dark pixels with step-by-step imagery

General highlight

Finished workflow overview

Steps

How can I automate this? With a macro!

Finished workflow (with macro) overview

Steps

Solution option 2: measuring byte size to determine “checks” with step-by-step imagery

General highlight

Steps

Final Thoughts