Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

PDF to Text for scanned in documents

msjac01
6 - Meteoroid

Does anyone have good experience using the PDF to Text tool w/ scanned-in low quality pdfs? Have some recurring refund checks we receive and I'm looking to scan the remits in and extract a few fields into a table format. The fields on check are always in same place and check formatting does not change.

10 REPLIES 10
alexnajm
16 - Nebula
16 - Nebula

Typically I would start with an Image Input tool to read in the files, (optionally) use an Image Processing tool to improve the quality, then finally the PDF to Text tool to read in the PDFs. Optionally also you can use an Image Template tool if the fields are always in the same place - that would go into the T anchor of the PDF to Text tool

msjac01
6 - Meteoroid

Hi Alex,

 

Thanks for the information. Does the Image Template tool work for multiple pdf pages in one attachment? I was able to get a successful output w/ just one page, but it doesn't seem to recognize when there are multiple pages in the pdf. Screenshot of workflow below.

alexnajm
16 - Nebula
16 - Nebula

To apply your annotations to all pages, select Apply First Page of Annotations in Image Template to All Pages in the PDF to Text: https://help.alteryx.com/current/en/designer/tools/alteryx-intelligence-suite/computer-vision/image-...

msjac01
6 - Meteoroid

Hi Alex,

 

Is 'Apply First Page of Annotations in Image Template to All Pages' a feature only on 2023.2 and newer Alteryx versions? I've yet to upgrade my current version (2022.3) and I'm not seeing that option anywhere.

 

alexnajm
16 - Nebula
16 - Nebula

It looks like that's accurate - I would upgrade to 2023.2 (maybe 2023.1, I don't have that versionon hand)

msjac01
6 - Meteoroid

We're still running the older Server version as well, so I'm hesitant to upgrade my desktop version until we upgrade the Server environment.... have had issues in the past w/ version differences. Any other thoughts on a workflow without using the Invoice Template tool?

 

 

alexnajm
16 - Nebula
16 - Nebula

You can read in one of the PDFs with multiple pages into the Image Template and mark up the additional pages - I would select the PDF with the most pages (i.e. a 4 page file) so it can apply the markups to any PDFs that have fewer pages (i.e. 4 or less)

mpeterson27
6 - Meteoroid

With using template, there are a few things you may want to consider:

 

  • Even if the information is in the same place, if the scan in is off slightly, you may not pull the data you need. 
  • The image template tool gets a little odd on server sometimes, especially with using UNC paths to the template or pdf
    • Make sure server can access the example PDF and the JSON template
      • Tip: You can manually edit the XML of the template tool, and copy/paste the bounds from a JSON - this ends up solving a lot of issues with using the template tool on server. It seems that something in the XML gets trunkated at upload, I have found the manual copy/paste of bounds mitigates that

With the low quality images, I do much of what Alex suggested, but if there are any changes of discrepancy on placement, I usually do not use the template tool. During early development of these sort of workflows: I read in all data (no templates), using the 4 different types of data available (lines, string, pipe-delimited table, Alteryx table) and then assess the outputs and see which might be best. You will clean up the data different ways depending on the output (series of filter tools and xml parsing), but I try to see which gets it the closest to the format I need and then narrow down the PDF to text input to that output option. From there I build a workflow around cleaning up the data. 

With this approach, I've found I can create more dynamic solutions that rely a bit less on perfectly scanning in docs to fit templates. The trade off is that it is more dev time, but usually produces more consistent outputs not effected by position of data on the PDF. 

alexnajm
16 - Nebula
16 - Nebula

Great summary @mpeterson27 !

Labels