Upgrading Server to version 22.1? We’ve enabled the faster AMP engine by default. Read about how these system settings changes might affect you.

We’re aware of an intermittent issue with our My Alteryx login and are actively working to have the issue corrected. If you run into an error when logging in, please try clearing all cookies or accessing the community on a different browser. Thank you for your patience!

Data Science

Machine learning & data science for beginners and experts alike.
georgewillv
Alteryx
Alteryx

Imagine you’re an auditor: you’ve been tasked with validating information on a company’s SEC Form 10Q. This seems like an easy task, but there’s a twist: you’ve only been sent a PDF version of this document. You slump your head. What could’ve taken a day at most will now likely take a few days, or weeks, as you don’t know how to extract the information other than copying and pasting it out.

 

giphy.gif

Source

 

Luckily, you see the Alteryx Designer icon in your taskbar! You lift your head. Sparkles form in your eyes as you remember reading an article on the Computer Vision tools and how they can extract tabular information from PDFs. You open Designer and start building out your workflow.

 

Reading in the Data

 

Select the Input Image tool. Within it select the folder that contains the PDF you want to analyze and name the blob you want to pass downstream. In the Image Template tool, select the blob you just named. Connect the Image Input Tool to the D anchor of the Image to Text tool, and the Image Template Tool to the T anchor of the Image to Text tool. By establishing these connections, Designer can do automatic table detection.

 

screenshot1.png

 

This configuration allows Designer to identify text within the PDF tables and output the results as usable tables for analysis. Some data prep and cleaning must be done along the way; however, this approach is a huge time saver compared to the alternative of copying, pasting, and manually organizing each field within Excel.

 

Data Prep

 

Now that we have read in the data, the Image to Text tool will output “table” columns when it detects a table on a given page of the PDF.

 

screenshot2.png

Note: this workflow is only going to cleanse and prep the table on page 3 as an example. This section represents the “assets” portion of the Balance Sheet. Results will vary given the type of table and the use case.

 

Here we can see that there is a table on page 3 and that Designer believes it has 3 columns. After filtering out the null values (which signify that the computer vision tools did not detect a table), let’s use the Text to Column tool to parse the data and see what we’re dealing with.

 

screenshot3.png

This container is cached; we highly recommend caching this part of the workflow to avoid waiting on the data to be read in every time you make a change manipulating it. Click the output anchor of the Image to Text tool and click “cache and run workflow.”

 

After splitting the columns into rows on the “\n” (new line) delimiter, we can see there is information separated by pipes when we click on the individual cells in the “table0” column.

 

screenshot4.png

After splitting to rows with “\n” delimiter

 

We then apply a Filter to grab the data from page 3, the assets section of the balance sheet. We will now be using the Text to Columns tool again, this time splitting the data into columns using the pipe (“|”) as a delimiter.

 

screenshot5.png

After splitting to columns on | delimiter

 

By using the Text to Columns Tool again, we see that there are two dates with their own respective information – September 30th, 2006 (Table 02) and December 31st, 2005 (Table 03).

 

*We’re going to skip over the gritty details of the cleanup process used. If you’re interested in how the data was prepped post extraction from the Image to Text tool, please skip to the bottom to see how you can run the workflow yourself.

 

Analysis and Results

 

Now that the data is prepped and cleansed, let’s being our analysis! In the bottom half of our workflow, we’re going to pull in an earlier 10Q from this company to ensure data quality. Since Table 03 was found in both PDF files we have used, we want to make sure that the unstructured text line items on the balance sheets match for the December columns. Since this is for the same financial year, the values should be identical. If the values aren’t correct, then we’re going to have to dig deeper into the numbers and see what’s going on.

 

This is also a good way to ensure that formatting is correct across the different time periods to emphasize maximum data usability in the event time periods vary. Additionally, we’re going to find the quarter over quarter change for the line items on the March 10Q form. This is done to augment an auditor’s horizontal analysis on the company’s balance sheet. If there are any substantial deltas, the auditor would need to dive deeper to uncover the reason for period over period movements.

 

Now that this PDF data is clean, we want to compare the table03 column from a past 10Q to ensure that they match.

 

screenshot6.png

After processing and cleaning March’s 10Q form

 

Since we’ve done a lot of the leg work during the cleaning, reaching the end goal is a relatively straightforward task. We run a Join on the union-ed streams. Then run a Formula tool to create a “Fields Match” column specifying if the fields match or not if they’re equal. Finally, we’ll filter on that “Fields Match” column to see if any of the fields were not equal.

 

screenshot7.png

Data Check to see if anything is awry

 

And if we look at the False output:

 

screenshot8.png

 

We can see that there is nothing that was not equal. This indicates that everything is in order with this portion of the 10Q.

 

Finally, we’re going to get the quarter over quarter (QOQ) change for the line items on the March balance sheet. This is an important step for auditors performing horizontal analysis as they need to apply this step to every item on every financial document the company has for the year. This enables a “big picture” perspective in terms of growth for the year. To get the QOQ change, we simply apply a Formula tool with the following expression:

 

screenshot9.png

 

And now our output looks like the following:


screenshot10.png

 QOQ output

 

And there you have it. You’ve successfully compared two separate 10Qs from this company and validated that things are as expected. Additionally, you’ve uncovered the QOQ change for this balance sheet for the March SEC 10Q. You apply the steps you’ve learned in this article to the rest of the document and finish your analysis before the end of the workday. You send the report to your manager, who is stunned you’ve pulled off such analysis so quickly! He informs you that you’re going to be getting a bonus this quarter and are on the right path for a promotion! You head home after a job well done, all thanks to the analytics process automation power of Alteryx.

 

Try It Yourself

 

If you would like to try out this workflow tutorial:

  1. Download a trial version of Alteryx Intelligence Suite if you don’t already have it here: Alteryx Intelligence Suite Free Trial
  2. Download the Form 10Q Processing zip file
  3. Unzip the Folder.
  4. Open Form10Q_Processing.yxmd in Designer
  5. Specify the Directory tools to the NTST_10Q_Sept06 and NTST_10Q_Mar06 in the top and bottom half of the workflow, respectively.
  6. Run the workflow! Please be aware that the run will take some time (3-5 minutes) depending on your system resources

Additional resources on the AIS tools: