Data Science

georgewillv · ‎03-10-2022

Imagine you’re an auditor: you’ve been tasked with validating information on a company’s SEC Form 10Q. This seems like an easy task, but there’s a twist: you’ve only been sent a PDF version of this document. You slump your head. What could’ve taken a day at most will now likely take a few days, or weeks, as you don’t know how to extract the information other than copying and pasting it out.

Source

Luckily, you see the Alteryx Designer icon in your taskbar! You lift your head. Sparkles form in your eyes as you remember reading an article on the Computer Vision tools and how they can extract tabular information from PDFs. You open Designer and start building out your workflow.

Reading in the Data

Select the Input Image tool. Within it select the folder that contains the PDF you want to analyze and name the blob you want to pass downstream. In the Image Template tool, select the blob you just named. Connect the Image Input Tool to the D anchor of the Image to Text tool, and the Image Template Tool to the T anchor of the Image to Text tool. By establishing these connections, Designer can do automatic table detection.

This configuration allows Designer to identify text within the PDF tables and output the results as usable tables for analysis. Some data prep and cleaning must be done along the way; however, this approach is a huge time saver compared to the alternative of copying, pasting, and manually organizing each field within Excel.

Data Prep

Now that we have read in the data, the Image to Text tool will output “table” columns when it detects a table on a given page of the PDF.

Note: this workflow is only going to cleanse and prep the table on page 3 as an example. This section represents the “assets” portion of the Balance Sheet. Results will vary given the type of table and the use case.

Here we can see that there is a table on page 3 and that Designer believes it has 3 columns. After filtering out the null values (which signify that the computer vision tools did not detect a table), let’s use the Text to Column tool to parse the data and see what we’re dealing with.

This container is cached; we highly recommend caching this part of the workflow to avoid waiting on the data to be read in every time you make a change manipulating it. Click the output anchor of the Image to Text tool and click “cache and run workflow.”

After splitting the columns into rows on the “\n” (new line) delimiter, we can see there is information separated by pipes when we click on the individual cells in the “table0” column.

After splitting to rows with “\n” delimiter

We then apply a Filter to grab the data from page 3, the assets section of the balance sheet. We will now be using the Text to Columns tool again, this time splitting the data into columns using the pipe (“|”) as a delimiter.

After splitting to columns on | delimiter

By using the Text to Columns Tool again, we see that there are two dates with their own respective information – September 30th, 2006 (Table 02) and December 31st, 2005 (Table 03).

*We’re going to skip over the gritty details of the cleanup process used. If you’re interested in how the data was prepped post extraction from the Image to Text tool, please skip to the bottom to see how you can run the workflow yourself.

Analysis and Results

Now that the data is prepped and cleansed, let’s being our analysis! In the bottom half of our workflow, we’re going to pull in an earlier 10Q from this company to ensure data quality. Since Table 03 was found in both PDF files we have used, we want to make sure that the unstructured text line items on the balance sheets match for the December columns. Since this is for the same financial year, the values should be identical. If the values aren’t correct, then we’re going to have to dig deeper into the numbers and see what’s going on.

This is also a good way to ensure that formatting is correct across the different time periods to emphasize maximum data usability in the event time periods vary. Additionally, we’re going to find the quarter over quarter change for the line items on the March 10Q form. This is done to augment an auditor’s horizontal analysis on the company’s balance sheet. If there are any substantial deltas, the auditor would need to dive deeper to uncover the reason for period over period movements.

Now that this PDF data is clean, we want to compare the table03 column from a past 10Q to ensure that they match.

After processing and cleaning March’s 10Q form

Since we’ve done a lot of the leg work during the cleaning, reaching the end goal is a relatively straightforward task. We run a Join on the union-ed streams. Then run a Formula tool to create a “Fields Match” column specifying if the fields match or not if they’re equal. Finally, we’ll filter on that “Fields Match” column to see if any of the fields were not equal.

Data Check to see if anything is awry

And if we look at the False output:

We can see that there is nothing that was not equal. This indicates that everything is in order with this portion of the 10Q.

Finally, we’re going to get the quarter over quarter (QOQ) change for the line items on the March balance sheet. This is an important step for auditors performing horizontal analysis as they need to apply this step to every item on every financial document the company has for the year. This enables a “big picture” perspective in terms of growth for the year. To get the QOQ change, we simply apply a Formula tool with the following expression:

And now our output looks like the following:

QOQ output

And there you have it. You’ve successfully compared two separate 10Qs from this company and validated that things are as expected. Additionally, you’ve uncovered the QOQ change for this balance sheet for the March SEC 10Q. You apply the steps you’ve learned in this article to the rest of the document and finish your analysis before the end of the workday. You send the report to your manager, who is stunned you’ve pulled off such analysis so quickly! He informs you that you’re going to be getting a bonus this quarter and are on the right path for a promotion! You head home after a job well done, all thanks to the analytics process automation power of Alteryx.

Try It Yourself

If you would like to try out this workflow tutorial:

Download a trial version of Alteryx Intelligence Suite if you don’t already have it here: Alteryx Intelligence Suite Free Trial
Download the Form 10Q Processing zip file
Unzip the Folder.
Open Form10Q_Processing.yxmd in Designer
Specify the Directory tools to the NTST_10Q_Sept06 and NTST_10Q_Mar06 in the top and bottom half of the workflow, respectively.
Run the workflow! Please be aware that the run will take some time (3-5 minutes) depending on your system resources

Additional resources on the AIS tools:

Data Science

Using Alteryx to Extract SEC Form 10Q Data

Reading in the Data

Data Prep

Analysis and Results

Try It Yourself

Extracting Data Help!!

Form data question

Extract information using Alteryx.

Data Extraction Reformatting from Excel

Extracting Data from String