Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Check images extraction using pdf input and image to text tools

Natalia_vf
6 - Meteoroid

Natalia_vf_0-1616700300831.png

I'm using the Alteryx Intelligence Suite. I have 77 pages of cancelled check images and I'm trying to extract information. We need Check Payee, Check Amount, Check Date, and Check Number. The way I got it working was to capture each check separately, and each field within that check separately. The problems I foresee are:

 

  1. We have 77 pages of this. I couldn’t seem to get it function where I have a template for the first page, and then that template is applied to all the other pages. As a result, I would need to make 77 different PDF’s. That can’t be the most efficient way. Should I use a macro?
  2. The OCR ends up being very wrong anyways, and I would need to manually fix most of it. I appreciate any help!
    1. No Check Payee is captured sometimes.
    2. Check Dates are way off for almost all.
    3. Check amounts and numbers are either blank or complete gibberish.
5 REPLIES 5
BenMoss
ACE Emeritus
ACE Emeritus

Hi @Natalia_vf I don't have any experience with these tools but I do with PDF extraction in Alteryx, in the past I have used this macro created by @OllieClarke in order to batch read PDFS.

 

The output is the raw text so it then becomes a case of creating a parsing methodology which allows you to extract the information you want (RegEx tool is usually your friend here).

 

I don't expect you'll be able to share an example, but please take a look at this tool and see if it helps/makes your life easier!

 

https://gallery.alteryx.com/#!app/PDF-Input/5b685aff0462d710907f7a3b

 

Ben

cgoodman3
14 - Magnetar
14 - Magnetar

@Natalia_vf 

 

I have posted in the ideas form for this to updated as a feature, so it would be worth adding comments to this post to add to the potential that this becomes a native feature.

 

In the meantime, the workaround I have found is to add a record ID tool so you still know which document it is, then update the page number using a formula tool. This tricks all the in-bound documents into looking like page 1 which is how the template is set up.

 

cgoodman3_0-1616749840397.png

Chris
Check out my collaboration with fellow ACE Joshua Burkhow at AlterTricks.com
Paul-Evans
9 - Comet

Somewhere between versions 2020.2 and 2021.2, this workaround no longer works. 

In addition to changing all 'page' value to '1', you will need to modify the 'path' field so that all of those are unique. 

 

PaulEvans_0-1630151924877.png

 

 

trettelap
8 - Asteroid

This seems to work, but I am little stumped as to why because the path doesn't seem to be referenced. Any insight on what that formula does?

Paul-Evans
9 - Comet

Under expected usage, the tool can have only one value per annotation name per file (e.g. you can't use the same annotation name even if it's on a different page of the template). My assumption is that the result of the extraction are saved back to the original table by using filename and page as key fields, rather than just processing by line. It seems that, in the case of duplicate filename and page combinations, only the last one is retained before being joined back to the original table.

Labels