Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Maveryx Success Stories

Learn how Alteryx customers transform their organizations using data and analytics.
STORIES WANTED

Showcase your achievements in the Maveryx Community by submitting a Success Story now!

SUBMISSION INSTRUCTIONS

Using the Text Mining Tools on Federal Tax Forms

jacob_kahn
12 - Quasar

Overview of Use Case

There has been a lot of excitement regarding the new Intelligence Suite released by Alteryx. I am very fortunate and thankful to my sales engineer - @MikeN helping me get the new Text Mining tool palette installed on my computer. Since, I can only say I’ve been addicted to building out use cases and a business case to buy the complete suite out in September 2020.

 

One of the biggest challenges I’ve faced as an Alteryx Artisan and user in my organization is telling teams throughout the organization that I cannot help them with their problems using Alteryx if most of their data is in PDF format. There were times where I’d suggest converting PDF documents to Excel, or utilize OCR technologies, but these solutions are either inefficient, inconsistent, or very expensive. The new Intelligence Suite Text Mining tool palette has changed that for me going forward.

 Season 10 Cha Ching GIF by RuPaul's Drag Race - Find & Share on GIPHY

 
Describe the business challenge or problem you needed to solve
 
I’m an accountant. I deal with PDF forms all the time. I send clients PDF forms, I file certain states with PDF forms, and I receive PDF forms from almost everyone.
 
Sleeping GIF - Find & Share on GIPHY
 

In the example that spiked this use case, we were onboarding a new client that only had their prior year forms in PDF format. If you are a tax accountant and familiar with Thomson Reuters products and XML filings, you know there are certain ways of moving data within and between systems, however – this was not the case.

 

I was tasked with manually entering prior year information from the clients PDF federal tax returns into Excel workbooks – much of the information is carried forward on current year federal tax forms.

I was like uhhhh…manually? With my fingers? In Excel? Like Adobe? What?


Queens What GIF by Like A Boss Movie - Find & Share on GIPHY

 
Describe your working solution
 
All I have to say now is Alteryx – Text Mining Tool Palettes.

Firstly, I used the Image Template tool to map out the annotations (or fields) in my PDF form that I wanted to extract information from.

 
 
 

Image 1.png


Then, I navigate to a PDF form with the PDF Input tool also found in the new Text Mining tool palette.

 

I perform some simple data manipulation to make sure that the pages in my PDF document match with the correct PDF template built into an Image Template tool, and simply run it through the new Image to Text tool. The configuration is very simple if you haven’t used it before.

 

Image 2.png

 

 

With further manipulation and Transform tools, I am able to transpose all of the data on the path and page of the data extracted from the PDF form.

 

Image 3.png

 

 

I used a batch macro grouped by the actual fields (i.e. annotations) to cleanse all of the data. This is so that I can append a RecordID to each grouped set in instances where there may be a tabbed line of information containing information on multiple lines.

 

You can see how clean the data looks after it runs through the batch macro.

 

Image 5.png

 

Also note the point I just referenced where for Line 02a – you actually get three records each with a different RecordID number. This way I know when there is a line 1, a line 2 and a line 3. Often times, this is a name and address.

 

Describe the benefits you have achieved
 

This use case and tool we built is going to save us time extracting data from prior year federal tax returns that we only have via PDF file sharing. We also eliminate human error in data entry.

 

We simply now mapped the Excel output that comes out of workflow to other third party applications, Excel workbooks and Alteryx workflows to continue the process efficiently and quick.

 

There is so much more to discuss with regards to confirming all of the information that you wanted to receive. I’ve simply built Text Input tools that contain all of the data points I want to extract, and simply use a Find Replace tool to append the extracted data to my template – then I can simply just review what was extracted and what was empty on the form.

 

Mind Blown GIFs - Get the best GIF on GIPHY


Why this over OCR?

Well simply OCR is expensive. It takes a lot of time to map documents to their proper data points, and it requires users to confirm the data extraction in order to assist with the machine learning side of the technology. With the annotations feature in the Image Template tool, I don’t have to confirm data extraction – I know that the tool will always refer to the exact same location.

 

Where is there to grow?

There are a lot of things I am still unsure of with the new Text Mining tool palette. For example, sometimes when I map out an entire PDF form, the tool actually bugs out and I lose my annotations which I spend a lot of time creating. As well, there are instances where the actual PDF shared with me is printed in a different format or size which also causes error in my data extraction. This is something that proves OCR more consistent and a better investment if this scenario will present itself in many cases. However, for the current investment we make in Alteryx, and the tools that we have at hand – this has become an amazing feature and addition to my data skills library.

 

If you have any questions on anything discussed in this use case, please feel free to reach out here, Via LinkedIn, or through my Instagram (if you hover over my Profile icon).

 

I am so happy to share this with the Alteryx community and I can’t wait to see what others build with the text mining tools.

 

Happy Pride – Happy Summer – Happy Health – Happy Unity.

 

Walking to 11.11.11. A Day Of Peaceful Intention | Contemporary Shaman

 

J

Unicorn Horn GIFs - Get the best GIF on GIPHY

Comments
JunePark
8 - Asteroid

I should try this out as well, thanks.

ChrisK1
7 - Meteor

Thank you for sharing and this is an awesome process! This may be a silly question, but where did you find the text mining tool palette? I was doing some searching and I'm not able to find them. Are these tools that you have developed yourself?

rohit782192
11 - Bolide
Hi,

Even i am also not able to see tool palette.
jacob_kahn
12 - Quasar

Hey everyone! I felt the same way when I heard about the Text Mining Tools! I'd recommend that you reach out to your Sales Engineer! I reached out and they were extremely helpful in installing the tool palette. 

Please remember that you have to have the R-Package installed on your computer; and that should be running as Admin 😉

 

Thanks 🙂

 

J

ThalitaC
Alteryx Alumni (Retired)

@ChrisK1 @rohit782192 you can also find more information about how to install the Text Mining tool categories in here:

https://help.alteryx.com/current/designer/alteryx-intelligence-suite

and like Jacob said you can always contact your Sales Engineer and they will be happy to help you. 

rohit782192
11 - Bolide
Hi,

There is no trail version available for these. I have alteryx higher
version 20.2
MikeN
Alteryx Alumni (Retired)

@rohit782192 - there is a trial for this : reach out to your Alteryx rep.

KylePeterson
7 - Meteor

This is really cool @jacob_kahn !  I work in Indirect Tax so I can truly see the benefits of this on our hundreds of returns each month ;).  

 

Thanks for sharing this!

PhilippK
Alteryx Alumni (Retired)

Good content @jacob_kahn !
Can you share the workflow with us?

Thank you!

trettelap
8 - Asteroid

Awesome article! Also interested in the workflow...

rohit782192
11 - Bolide
If possible to share workflows

Thanks and Regards
Rohit Gupta.
pvara
8 - Asteroid

can you please share your workflow? I am trying to build my first pdf extraction WF and I am not following your example.

 

Thank you

sriniprad08
11 - Bolide

Hi 

Great stuff. Can you please share the work flow? Thanks

cpearse
6 - Meteoroid

Can you provide an example of this workflow please?