Showcase your achievements in the Maveryx Community by submitting a Success Story now!
SUBMISSION INSTRUCTIONSThere has been a lot of excitement regarding the new Intelligence Suite released by Alteryx. I am very fortunate and thankful to my sales engineer - @MikeN helping me get the new Text Mining tool palette installed on my computer. Since, I can only say I’ve been addicted to building out use cases and a business case to buy the complete suite out in September 2020.
One of the biggest challenges I’ve faced as an Alteryx Artisan and user in my organization is telling teams throughout the organization that I cannot help them with their problems using Alteryx if most of their data is in PDF format. There were times where I’d suggest converting PDF documents to Excel, or utilize OCR technologies, but these solutions are either inefficient, inconsistent, or very expensive. The new Intelligence Suite Text Mining tool palette has changed that for me going forward.
In the example that spiked this use case, we were onboarding a new client that only had their prior year forms in PDF format. If you are a tax accountant and familiar with Thomson Reuters products and XML filings, you know there are certain ways of moving data within and between systems, however – this was not the case.
I was tasked with manually entering prior year information from the clients PDF federal tax returns into Excel workbooks – much of the information is carried forward on current year federal tax forms.
I was like uhhhh…manually? With my fingers? In Excel? Like Adobe? What?
Firstly, I used the Image Template tool to map out the annotations (or fields) in my PDF form that I wanted to extract information from.
Then, I navigate to a PDF form with the PDF Input tool also found in the new Text Mining tool palette.
I perform some simple data manipulation to make sure that the pages in my PDF document match with the correct PDF template built into an Image Template tool, and simply run it through the new Image to Text tool. The configuration is very simple if you haven’t used it before.
With further manipulation and Transform tools, I am able to transpose all of the data on the path and page of the data extracted from the PDF form.
I used a batch macro grouped by the actual fields (i.e. annotations) to cleanse all of the data. This is so that I can append a RecordID to each grouped set in instances where there may be a tabbed line of information containing information on multiple lines.
You can see how clean the data looks after it runs through the batch macro.
Also note the point I just referenced where for Line 02a – you actually get three records each with a different RecordID number. This way I know when there is a line 1, a line 2 and a line 3. Often times, this is a name and address.
This use case and tool we built is going to save us time extracting data from prior year federal tax returns that we only have via PDF file sharing. We also eliminate human error in data entry.
We simply now mapped the Excel output that comes out of workflow to other third party applications, Excel workbooks and Alteryx workflows to continue the process efficiently and quick.
There is so much more to discuss with regards to confirming all of the information that you wanted to receive. I’ve simply built Text Input tools that contain all of the data points I want to extract, and simply use a Find Replace tool to append the extracted data to my template – then I can simply just review what was extracted and what was empty on the form.
Why this over OCR?
Well simply OCR is expensive. It takes a lot of time to map documents to their proper data points, and it requires users to confirm the data extraction in order to assist with the machine learning side of the technology. With the annotations feature in the Image Template tool, I don’t have to confirm data extraction – I know that the tool will always refer to the exact same location.
Where is there to grow?
There are a lot of things I am still unsure of with the new Text Mining tool palette. For example, sometimes when I map out an entire PDF form, the tool actually bugs out and I lose my annotations which I spend a lot of time creating. As well, there are instances where the actual PDF shared with me is printed in a different format or size which also causes error in my data extraction. This is something that proves OCR more consistent and a better investment if this scenario will present itself in many cases. However, for the current investment we make in Alteryx, and the tools that we have at hand – this has become an amazing feature and addition to my data skills library.
If you have any questions on anything discussed in this use case, please feel free to reach out here, Via LinkedIn, or through my Instagram (if you hover over my Profile icon).
I am so happy to share this with the Alteryx community and I can’t wait to see what others build with the text mining tools.
Happy Pride – Happy Summer – Happy Health – Happy Unity.
J
I should try this out as well, thanks.
Thank you for sharing and this is an awesome process! This may be a silly question, but where did you find the text mining tool palette? I was doing some searching and I'm not able to find them. Are these tools that you have developed yourself?
Hey everyone! I felt the same way when I heard about the Text Mining Tools! I'd recommend that you reach out to your Sales Engineer! I reached out and they were extremely helpful in installing the tool palette.
Please remember that you have to have the R-Package installed on your computer; and that should be running as Admin 😉
Thanks 🙂
J
@ChrisK1 @rohit782192 you can also find more information about how to install the Text Mining tool categories in here:
https://help.alteryx.com/current/designer/alteryx-intelligence-suite
and like Jacob said you can always contact your Sales Engineer and they will be happy to help you.
@rohit782192 - there is a trial for this : reach out to your Alteryx rep.
This is really cool @jacob_kahn ! I work in Indirect Tax so I can truly see the benefits of this on our hundreds of returns each month ;).
Thanks for sharing this!
Good content @jacob_kahn !
Can you share the workflow with us?
Thank you!
Awesome article! Also interested in the workflow...
can you please share your workflow? I am trying to build my first pdf extraction WF and I am not following your example.
Thank you
Hi
Great stuff. Can you please share the work flow? Thanks
Can you provide an example of this workflow please?