Data Science

Machine learning & data science for beginners and experts alike.
KnowledgeSpark
Alteryx Alumni (Retired)

In Compare PDF Files Using Computer Vision: Uncover Existence of Handwritten Notes, Part 1, Steps 1 through 4 were used to identify a list of two out of six pages as having handwritten markups on a signed contract when compared to the original contract.

 

Part 2 takes that list and makes it pretty! More specifically, I combined Computer Vision with Reporting tools to display the original and marked up pages side-by-side in a new PDF file.

 

Since the list of pages with markups comes from a mathematical equation that tests the ratio of bright pixel counts between the original and new documents, my idea was to view those markup pages as *potential* markup pages until a human could review them. That’s why the end product is a collection of side-by-side pages.

 

Step 5 – Build Report

 

This collection of tools is all about building a report that can be sent to a PDF file.

 

image001.jpg

 

The first part of Step 5 focuses on processing the images from the original and signed contracts in parallel. Those get joined together, sorted by page number, and then processed as a header and a table, which in turn get stacked together. Let’s break it down.

 

Image Processing – Shrinking to 25% of the Original Size

 

Remember that our final output displays each full-sized marked up page of the unsigned and signed contracts side-by-side. The way we accomplish that is by shrinking the pages to a size that is small enough to allow inspection on the screen (or a letter-sized piece of paper) yet large enough to let visual inspection be easy.

 

I mentally numbered the next three tools dealing with File 1 as numbers 1, 2, and 3. Here’s a screenshot to show you want I mean.

 

KnowledgeSpark_1-1672159425835.png

 

Tool 1 in this set is Image Processing. The data before it runs looks like this. I will be working with the image in the field [File 1 Original Image].

 

image003.jpg

 

With the Image Processing tool (Tool 1) selected, I’ll choose the field I care about from the dropdown list. Then I will add the Scale step and set it to 25 as the percentage of the Width. With the checkbox checked for Lock Aspect Ratio, the Height field will also be set to 25% of the original.

 

KnowledgeSpark_3-1672159425854.png

 

Once this tool runs, it creates a new field named [File 1 Original Image_processed].

 

Next, Tool 2 uses the Image tool from the Reporting tab.

 

KnowledgeSpark_4-1672159425859.png

 

Configuration for this tool clicks the radio button next to “Get Image From Binary Data In Field” and then selects the new [File 1 Original Image_processed] field.

 

image005.jpg

 

At this point, I’ve got the original image shrunk to 25%, stored as a new field, and that new field has been processed as a binary image. The Image tool takes the field name of whatever it was given and names it [Image]. It’s like that one great aunt who calls every male person “Carl,” regardless of that person’s actual name. (You don’t have an aunt like that? Lucky you.)

 

Tool 3 is just a Select tool that drops most of the fields and renames the [Image] field (or “Carl,” as we affectionally call it as a nickname) to something more descriptive, like “File 1 Image.”

 

KnowledgeSpark_6-1672159425884.png

 

That’s one set of three tools, all of which have been focused on File 1.

 

Then there’s File 2. That gets its own set of three tools, which do all the same things, only pointed at [File 2 Original Image] instead of [File 1 Original Image].

 

KnowledgeSpark_7-1672159425892.png

 

Then I join the two streams of processed images back together on the [Test for Markups] field.

 

KnowledgeSpark_8-1672159425916.png

 

To be tidy, I drop the [Test for Markups] and [Page] fields from the right input.

 

KnowledgeSpark_9-1672159425931.png

 

Now I add a Sort tool and set it to sort by the [Page] field in ascending order. In this example, that makes sure that page 3 appears before page 4.

 

Moving along, I create a header for the report.

 

KnowledgeSpark_10-1672159425940.png

 

Now I add a Select tool and then drop every field except for [File 1] and [File 2], which were just added in the steps immediately prior.

 

KnowledgeSpark_11-1672159425947.png

 

The data will include one record for every page.

 

KnowledgeSpark_12-1672159425949.png

 

In this example, “every page” means just two records. I only need one record, so I use the Select Records tool to keep only the first row.

 

KnowledgeSpark_13-1672159425952.png

 

The final step for creating a header that automatically updates as the underlying data updates is to pull in the Report Text tool from the Reporting tab. This is configured to select the radio button next to “Create new field for this text” with “Report Header” entered as the Field Name value. Then the text for the header was typed in the section below with brackets wrapping the field names. Notice that you can adjust font sizes and add basic formatting like bold, italics, and underlining. Notice, too, that a field name gets put within brackets with a colon and a capital “A,” and the brackets are enclosed within quotes.

 

KnowledgeSpark_14-1672159425972.png

 

The text here gets rendered like this when the report runs.

 

KnowledgeSpark_15-1672159426038.png

 

These three tools created that. Useful, right?

 

KnowledgeSpark_16-1672159426042.png

 

Now I add the detail that will appear below the header.

 

KnowledgeSpark_17-1672159426052.png

 

This part simply makes use of the Table tool from the Reporting tab.

 

The top part of the Configuration is set to use the Basic table mode and to group by Page.

 

KnowledgeSpark_18-1672159426070.png

 

You can configure your reporting settings however you like, of course. However, if you want to duplicate my results, here is what I did with the column configurations.

 

The column configurations in the bottom part rename the Page field to “Page #”

 

KnowledgeSpark_19-1672159426086.png

 

The [Page] field:

  • Rename Field renamed Page to Page #
  • Alignment was changed from Right to Center
  • Column Rules were set up (click the Create button to create, and then it becomes the Edit button)
    • The Font Size was changed from the default of 8 to 10

 

KnowledgeSpark_20-1672159426095.png

 

The [Test for Markups] field was de-selected.

 

That’s it! All the other fields were left with default settings.

 

Next up, the report’s header and table were combined.

 

KnowledgeSpark_21-1672159426116.png

 

I used the Union tool with the header as the first connection and the table as the second connection. Normally, I can leave a Union Join with its default configuration of Auto Config by Name, but this time it was important to change that to Manually Configure Fields. I arranged things so that [Table] from stream #2 fell underneath [Report Header] from stream #1. [File 1] and [File 2] stood alone, and so did the [Page] field.

 

KnowledgeSpark_22-1672159426120.png

 

Next (and second to last), the Layout tool followed the Union Join tool. There was no need to change any default settings here.

 

Step 6 – Final Output

 

Ready for the final step? Just one more baby step, and we’re done!

 

That final step is the Render tool. This one needs the dropdown value “Insert Section Breaks Between Records” to be selected for the Separator section. Do that, and you’re home free!

 

KnowledgeSpark_23-1672159426136.png

 

That’s a Wrap

 

There you have it! When you have an original contract and a revised contract, Alteryx can compare the two and build a list of which (if any) pages have extra marks, notes, or doodles between the original and the new version. Better yet, it can create a report to quickly show you which pages have manual markups, saving time and potentially catching someone trying to sneak in last-minute, unapproved changes.

 

Here's what the final report looks like.

 

KnowledgeSpark_24-1672159426157.png

 

KnowledgeSpark_25-1672159426428.png

 

KnowledgeSpark_26-1672159426705.png