I'm using the Alteryx Intelligence Suite. I have 77 pages of cancelled check images and I'm trying to extract information. We need Check Payee, Check Amount, Check Date, and Check Number. The way I got it working was to capture each check separately, and each field within that check separately. The problems I foresee are:
Solved! Go to Solution.
Hi @Natalia_vf I don't have any experience with these tools but I do with PDF extraction in Alteryx, in the past I have used this macro created by @OllieClarke in order to batch read PDFS.
The output is the raw text so it then becomes a case of creating a parsing methodology which allows you to extract the information you want (RegEx tool is usually your friend here).
I don't expect you'll be able to share an example, but please take a look at this tool and see if it helps/makes your life easier!
https://gallery.alteryx.com/#!app/PDF-Input/5b685aff0462d710907f7a3b
Ben
I have posted in the ideas form for this to updated as a feature, so it would be worth adding comments to this post to add to the potential that this becomes a native feature.
In the meantime, the workaround I have found is to add a record ID tool so you still know which document it is, then update the page number using a formula tool. This tricks all the in-bound documents into looking like page 1 which is how the template is set up.
Somewhere between versions 2020.2 and 2021.2, this workaround no longer works.
In addition to changing all 'page' value to '1', you will need to modify the 'path' field so that all of those are unique.
This seems to work, but I am little stumped as to why because the path doesn't seem to be referenced. Any insight on what that formula does?
Under expected usage, the tool can have only one value per annotation name per file (e.g. you can't use the same annotation name even if it's on a different page of the template). My assumption is that the result of the extraction are saved back to the original table by using filename and page as key fields, rather than just processing by line. It seems that, in the case of duplicate filename and page combinations, only the last one is retained before being joined back to the original table.