Does anyone have good experience using the PDF to Text tool w/ scanned-in low quality pdfs? Have some recurring refund checks we receive and I'm looking to scan the remits in and extract a few fields into a table format. The fields on check are always in same place and check formatting does not change.
Solved! Go to Solution.
Typically I would start with an Image Input tool to read in the files, (optionally) use an Image Processing tool to improve the quality, then finally the PDF to Text tool to read in the PDFs. Optionally also you can use an Image Template tool if the fields are always in the same place - that would go into the T anchor of the PDF to Text tool
To apply your annotations to all pages, select Apply First Page of Annotations in Image Template to All Pages in the PDF to Text: https://help.alteryx.com/current/en/designer/tools/alteryx-intelligence-suite/computer-vision/image-...
It looks like that's accurate - I would upgrade to 2023.2 (maybe 2023.1, I don't have that versionon hand)
We're still running the older Server version as well, so I'm hesitant to upgrade my desktop version until we upgrade the Server environment.... have had issues in the past w/ version differences. Any other thoughts on a workflow without using the Invoice Template tool?
You can read in one of the PDFs with multiple pages into the Image Template and mark up the additional pages - I would select the PDF with the most pages (i.e. a 4 page file) so it can apply the markups to any PDFs that have fewer pages (i.e. 4 or less)
With using template, there are a few things you may want to consider:
With the low quality images, I do much of what Alex suggested, but if there are any changes of discrepancy on placement, I usually do not use the template tool. During early development of these sort of workflows: I read in all data (no templates), using the 4 different types of data available (lines, string, pipe-delimited table, Alteryx table) and then assess the outputs and see which might be best. You will clean up the data different ways depending on the output (series of filter tools and xml parsing), but I try to see which gets it the closest to the format I need and then narrow down the PDF to text input to that output option. From there I build a workflow around cleaning up the data.
With this approach, I've found I can create more dynamic solutions that rely a bit less on perfectly scanning in docs to fit templates. The trade off is that it is more dev time, but usually produces more consistent outputs not effected by position of data on the PDF.
Great summary @mpeterson27 !