Alteryx Designer Desktop Discussions

msjac01 · ‎04-08-2024

Does anyone have good experience using the PDF to Text tool w/ scanned-in low quality pdfs? Have some recurring refund checks we receive and I'm looking to scan the remits in and extract a few fields into a table format. The fields on check are always in same place and check formatting does not change.

alexnajm · ‎04-09-2024

Typically I would start with an Image Input tool to read in the files, (optionally) use an Image Processing tool to improve the quality, then finally the PDF to Text tool to read in the PDFs. Optionally also you can use an Image Template tool if the fields are always in the same place - that would go into the T anchor of the PDF to Text tool

msjac01 · ‎04-09-2024

Hi Alex,

Thanks for the information. Does the Image Template tool work for multiple pdf pages in one attachment? I was able to get a successful output w/ just one page, but it doesn't seem to recognize when there are multiple pages in the pdf. Screenshot of workflow below.

alexnajm · ‎04-09-2024

To apply your annotations to all pages, select Apply First Page of Annotations in Image Template to All Pages in the PDF to Text: https://help.alteryx.com/current/en/designer/tools/alteryx-intelligence-suite/computer-vision/image-...

msjac01 · ‎04-09-2024

Hi Alex,

Is 'Apply First Page of Annotations in Image Template to All Pages' a feature only on 2023.2 and newer Alteryx versions? I've yet to upgrade my current version (2022.3) and I'm not seeing that option anywhere.

alexnajm · ‎04-10-2024

It looks like that's accurate - I would upgrade to 2023.2 (maybe 2023.1, I don't have that versionon hand)

msjac01 · ‎04-10-2024

We're still running the older Server version as well, so I'm hesitant to upgrade my desktop version until we upgrade the Server environment.... have had issues in the past w/ version differences. Any other thoughts on a workflow without using the Invoice Template tool?

alexnajm · ‎04-10-2024

You can read in one of the PDFs with multiple pages into the Image Template and mark up the additional pages - I would select the PDF with the most pages (i.e. a 4 page file) so it can apply the markups to any PDFs that have fewer pages (i.e. 4 or less)

mpeterson27 · ‎04-10-2024

With using template, there are a few things you may want to consider:

Even if the information is in the same place, if the scan in is off slightly, you may not pull the data you need.
The image template tool gets a little odd on server sometimes, especially with using UNC paths to the template or pdf
- Make sure server can access the example PDF and the JSON template
  - Tip: You can manually edit the XML of the template tool, and copy/paste the bounds from a JSON - this ends up solving a lot of issues with using the template tool on server. It seems that something in the XML gets trunkated at upload, I have found the manual copy/paste of bounds mitigates that

With the low quality images, I do much of what Alex suggested, but if there are any changes of discrepancy on placement, I usually do not use the template tool. During early development of these sort of workflows: I read in all data (no templates), using the 4 different types of data available (lines, string, pipe-delimited table, Alteryx table) and then assess the outputs and see which might be best. You will clean up the data different ways depending on the output (series of filter tools and xml parsing), but I try to see which gets it the closest to the format I need and then narrow down the PDF to text input to that output option. From there I build a workflow around cleaning up the data.

With this approach, I've found I can create more dynamic solutions that rely a bit less on perfectly scanning in docs to fit templates. The trade off is that it is more dev time, but usually produces more consistent outputs not effected by position of data on the PDF.

alexnajm · ‎04-10-2024

Great summary @mpeterson27 !

Alteryx Designer Desktop Discussions

PDF to Text for scanned in documents