Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Data Science

Machine learning & data science for beginners and experts alike.
C3PO
Alteryx Alumni (Retired)

Filling out long and dense forms can be a chore, it’s an even bigger chore to copy and paste from them. Let’s see how Designer’s Computer Vision tools can help.

 

Below is a K-1 tax form that Peter Pan has been instructed to copy text from:

k1.png

 

That is a lot of text and Peter has more than one of these to review! He knows he needs to automate extraction of this data so he can get to the important work of analyzing this company’s financial processes to make sure they’re in compliance with tax laws.

 

But, oh my word (pun intended) that form has a lot of fields… and look at how dense section J is!

 

j.png

 

Never fear all Peter needs is a little faith and some computer vision pixie dust. Using a combination of the Image Input, Image Template and Image to Text tools we can automate extracting data from simple and complex forms! Let’s see how Peter can accomplish this for his K1 tax forms.

 

First we point the Image Input  tool at the folder that contains our K1 forms, then using the Image Template tool we “annotate” our document; this is how we tell Designer what values we want to extract and what field names they should be mapped to.

C3PO_4-1647898379213.png

 

Note on our form how there are many fields per section letter or number (hence forth called an identifier), as well as many values per field. One way to address this is to use a combination of the field name, identifier, and attribute values for annotation field names.

 

C3PO_5-1647898379220.png

 

C3PO_6-1647898379303.png

 

Best Practice: When extracting data from images a high resolution is recommended (300 DPI).

 

Hot tip: You can export your annotations as a JSON file. You can use this to share your annotations with someone or perhaps to create annotation standards for other developers.

 

C3PO_7-1647898379337.png

 

Once we’re done annotating our form fields we connect the Image Input and Image Template tools to the Image to Text tool and run our workflow to extract the text. The result is 1 row of data per annotation. From there, using the Text to Columns tool we separate field name, identifier, and field values.

 

C3PO_8-1647898379345.png

And with a little faith in column and row positions and some Transpose Tool pixie dust we don’t have to walk the copy and paste plank. Instead, we can now use Designer to analyze Partner’s share of profit, loss and capital for beginning and end periods.

 

C3PO_9-1647898379364.png

 

C3PO_10-1647898379377.png

 

Now that Peter has all his form fields in a format he can analyze, it’s time to put them all together. Recall that the Image to Text tool outputs standard file metadata fields like file name and path. With this file metadata we can group rows from the same file together.  

 

C3PO_11-1647898379379.png

 

C3PO_12-1647898379384.png

 

After grouping rows together, we union them into one data set with the aid of a consistent field schema using the Text Input tool, as well as Join and Union tools. And voila! Peter can now analyze the data from his K1 forms!

 

C3PO_13-1647898379386.png

 

C3PO_14-1647898379390.png

 

C3PO_15-1647898379405.png

 

With Designer Peter can complete his work much faster! Which means more time for shenanigans with the lost boys!

via GIPHY


Want to try the Alteryx Intelligence Suite for yourself?

Click here to download a free trial of the Alteryx Intelligence Suite.

 

Get the workflow in this blog here.

Comments
tdukes_bcbs
5 - Atom

Very well laid out and detailed. Thank you for uploading this. I do notice some garbage coming in when I run this.

 

Record 41 Ordinary business income (loss) value per pdf: 43496; per Alteryx workflow: "/43496"

 

There other lines with trash in the data.

 

Could this be due to DPI settings in the PDF?

 

Thank you again for this post.

 

Thomas Dukes