Data Science

KnowledgeSpark · ‎01-05-2023

In Compare PDF Files Using Computer Vision: Uncover Existence of Handwritten Notes, Part 1, Steps 1 through 4 were used to identify a list of two out of six pages as having handwritten markups on a signed contract when compared to the original contract.

Part 2 takes that list and makes it pretty! More specifically, I combined Computer Vision with Reporting tools to display the original and marked up pages side-by-side in a new PDF file.

Since the list of pages with markups comes from a mathematical equation that tests the ratio of bright pixel counts between the original and new documents, my idea was to view those markup pages as *potential* markup pages until a human could review them. That’s why the end product is a collection of side-by-side pages.

Step 5 – Build Report

This collection of tools is all about building a report that can be sent to a PDF file.

The first part of Step 5 focuses on processing the images from the original and signed contracts in parallel. Those get joined together, sorted by page number, and then processed as a header and a table, which in turn get stacked together. Let’s break it down.

Image Processing – Shrinking to 25% of the Original Size

Remember that our final output displays each full-sized marked up page of the unsigned and signed contracts side-by-side. The way we accomplish that is by shrinking the pages to a size that is small enough to allow inspection on the screen (or a letter-sized piece of paper) yet large enough to let visual inspection be easy.

I mentally numbered the next three tools dealing with File 1 as numbers 1, 2, and 3. Here’s a screenshot to show you want I mean.

Tool 1 in this set is Image Processing. The data before it runs looks like this. I will be working with the image in the field [File 1 Original Image].

With the Image Processing tool (Tool 1) selected, I’ll choose the field I care about from the dropdown list. Then I will add the Scale step and set it to 25 as the percentage of the Width. With the checkbox checked for Lock Aspect Ratio, the Height field will also be set to 25% of the original.

Once this tool runs, it creates a new field named [File 1 Original Image_processed].

Next, Tool 2 uses the Image tool from the Reporting tab.

Configuration for this tool clicks the radio button next to “Get Image From Binary Data In Field” and then selects the new [File 1 Original Image_processed] field.

At this point, I’ve got the original image shrunk to 25%, stored as a new field, and that new field has been processed as a binary image. The Image tool takes the field name of whatever it was given and names it [Image]. It’s like that one great aunt who calls every male person “Carl,” regardless of that person’s actual name. (You don’t have an aunt like that? Lucky you.)

Tool 3 is just a Select tool that drops most of the fields and renames the [Image] field (or “Carl,” as we affectionally call it as a nickname) to something more descriptive, like “File 1 Image.”

That’s one set of three tools, all of which have been focused on File 1.

Then there’s File 2. That gets its own set of three tools, which do all the same things, only pointed at [File 2 Original Image] instead of [File 1 Original Image].

Then I join the two streams of processed images back together on the [Test for Markups] field.

To be tidy, I drop the [Test for Markups] and [Page] fields from the right input.

Now I add a Sort tool and set it to sort by the [Page] field in ascending order. In this example, that makes sure that page 3 appears before page 4.

Moving along, I create a header for the report.

Now I add a Select tool and then drop every field except for [File 1] and [File 2], which were just added in the steps immediately prior.

The data will include one record for every page.

In this example, “every page” means just two records. I only need one record, so I use the Select Records tool to keep only the first row.

The final step for creating a header that automatically updates as the underlying data updates is to pull in the Report Text tool from the Reporting tab. This is configured to select the radio button next to “Create new field for this text” with “Report Header” entered as the Field Name value. Then the text for the header was typed in the section below with brackets wrapping the field names. Notice that you can adjust font sizes and add basic formatting like bold, italics, and underlining. Notice, too, that a field name gets put within brackets with a colon and a capital “A,” and the brackets are enclosed within quotes.

The text here gets rendered like this when the report runs.

These three tools created that. Useful, right?

Now I add the detail that will appear below the header.

This part simply makes use of the Table tool from the Reporting tab.

The top part of the Configuration is set to use the Basic table mode and to group by Page.

You can configure your reporting settings however you like, of course. However, if you want to duplicate my results, here is what I did with the column configurations.

The column configurations in the bottom part rename the Page field to “Page #”

The [Page] field:

Rename Field renamed Page to Page #
Alignment was changed from Right to Center
Column Rules were set up (click the Create button to create, and then it becomes the Edit button)
- The Font Size was changed from the default of 8 to 10

The [Test for Markups] field was de-selected.

That’s it! All the other fields were left with default settings.

Next up, the report’s header and table were combined.

I used the Union tool with the header as the first connection and the table as the second connection. Normally, I can leave a Union Join with its default configuration of Auto Config by Name, but this time it was important to change that to Manually Configure Fields. I arranged things so that [Table] from stream #2 fell underneath [Report Header] from stream #1. [File 1] and [File 2] stood alone, and so did the [Page] field.

Next (and second to last), the Layout tool followed the Union Join tool. There was no need to change any default settings here.

Step 6 – Final Output

Ready for the final step? Just one more baby step, and we’re done!

That final step is the Render tool. This one needs the dropdown value “Insert Section Breaks Between Records” to be selected for the Separator section. Do that, and you’re home free!

That’s a Wrap

There you have it! When you have an original contract and a revised contract, Alteryx can compare the two and build a list of which (if any) pages have extra marks, notes, or doodles between the original and the new version. Better yet, it can create a report to quickly show you which pages have manual markups, saving time and potentially catching someone trying to sneak in last-minute, unapproved changes.

Here's what the final report looks like.

Data Science

Compare PDF Files Using Computer Vision: Uncover Existence of Handwritten Notes, Part 2

Step 5 – Build Report

Image Processing – Shrinking to 25% of the Original Size

Step 6 – Final Output

That’s a Wrap

Release Notes

Computer vision/IS on server

Google Cloud Vision

Efficient Extraction of Company Insights: 10-K for...

Auto - Average computing