Data Science

C3PO · ‎03-22-2022

Filling out long and dense forms can be a chore, it’s an even bigger chore to copy and paste from them. Let’s see how Designer’s Computer Vision tools can help.

Below is a K-1 tax form that Peter Pan has been instructed to copy text from:

That is a lot of text and Peter has more than one of these to review! He knows he needs to automate extraction of this data so he can get to the important work of analyzing this company’s financial processes to make sure they’re in compliance with tax laws.

But, oh my word (pun intended) that form has a lot of fields… and look at how dense section J is!

Never fear all Peter needs is a little faith and some computer vision pixie dust. Using a combination of the Image Input, Image Template and Image to Text tools we can automate extracting data from simple and complex forms! Let’s see how Peter can accomplish this for his K1 tax forms.

First we point the Image Input tool at the folder that contains our K1 forms, then using the Image Template tool we “annotate” our document; this is how we tell Designer what values we want to extract and what field names they should be mapped to.

Note on our form how there are many fields per section letter or number (hence forth called an identifier), as well as many values per field. One way to address this is to use a combination of the field name, identifier, and attribute values for annotation field names.

Best Practice: When extracting data from images a high resolution is recommended (300 DPI).

Hot tip: You can export your annotations as a JSON file. You can use this to share your annotations with someone or perhaps to create annotation standards for other developers.

Once we’re done annotating our form fields we connect the Image Input and Image Template tools to the Image to Text tool and run our workflow to extract the text. The result is 1 row of data per annotation. From there, using the Text to Columns tool we separate field name, identifier, and field values.

And with a little faith in column and row positions and some Transpose Tool pixie dust we don’t have to walk the copy and paste plank. Instead, we can now use Designer to analyze Partner’s share of profit, loss and capital for beginning and end periods.

Now that Peter has all his form fields in a format he can analyze, it’s time to put them all together. Recall that the Image to Text tool outputs standard file metadata fields like file name and path. With this file metadata we can group rows from the same file together.

After grouping rows together, we union them into one data set with the aid of a consistent field schema using the Text Input tool, as well as Join and Union tools. And voila! Peter can now analyze the data from his K1 forms!

With Designer Peter can complete his work much faster! Which means more time for shenanigans with the lost boys!

via GIPHY

Want to try the Alteryx Intelligence Suite for yourself?

Click here to download a free trial of the Alteryx Intelligence Suite.

Get the workflow in this blog here.

tdukes_bcbs · ‎06-23-2022

Very well laid out and detailed. Thank you for uploading this. I do notice some garbage coming in when I run this.

Record 41 Ordinary business income (loss) value per pdf: 43496; per Alteryx workflow: "/43496"

There other lines with trash in the data.

Could this be due to DPI settings in the PDF?

Thank you again for this post.

Thomas Dukes

Data Science

Extracting Text from Tax Forms Making You Grow Old? Come to NeverCopyPasteLand