We have extended our Early Bird Tickets for Inspire 2023! Discounted pricing goes until February 24th. Save your spot!

Alteryx Designer Ideas

Share your Designer product ideas - we're listening!
Submitting an Idea?

Be sure to review our Idea Submission Guidelines for more information!

Submission Guidelines

Image to text - Option to apply template to all pages

When using the text mining tools, I have found that the behaviour of using a template only applies to documents with the same page number.

 

So in my use case I've got a PDF file with 100+ claim statements which are all laid out the same (one page per statement). When setting up the template I used one page to set the annotations, and then input this into the T anchor of the Image to Text tool. Into the D anchor of this tool is my PDF document with 100+ pages. However when examining the output I only get results for page 1.

 

On examining the JSON for the template I can see that there is reference to the template page number:

cgoodman3_0-1604393391514.png

 

And playing around with a generate rows tool and formula to replace the page number with pages 1 - 100 in the JSON doesn't work. I then discovered that if I change the page number on the image input side then I get the desired results. 

 

cgoodman3_1-1604393499357.png

However an improvement to the tool, as I suspect this is a common use case for the image to text tool, is to add an option in the configuration of the image to text tool to apply the same template to all pages.

 

cgoodman3_4-1604393738275.png

 

 

 

 

 

12 Comments
Kenda
15 - Aurora
15 - Aurora

Love this idea! I ran into a similar issue before

Dynamomo
11 - Bolide

I think this is the same request as my product idea - https://community.alteryx.com/t5/Alteryx-Designer-Ideas/Apply-formatting-defined-in-Image-Template-t...

Can we combine?

cgoodman3
13 - Pulsar
13 - Pulsar

Yes it looks to cover the same. I can’t combine, so we’ll need Alteryx to do that. @KylieF can you do this?

mpressive6
5 - Atom

I'm hoping that someone can provide me some assistance. I'm relatively new to the Alteryx platform and came across this thread which is similar to a business challenge that I'm attempting to solve. I need to search multiple PDFs (entire document) for certain key terms. I worked with my IT org to get me a Trial License which allows me access to the Text Mining module. I've been able to develop a workflow that's got me close to a working solution but I need to be able to search the entire document, not just one page which still appears to be a limitation of the Image Template tool.

 

Chris below seems to have come up with an elegant workaround which allowed him to "change the page number on the image input side". Can anyone share an example workflow using the formula tool below on how this would be done? 

 

 

On examining the JSON for the template I can see that there is reference to the template page number:

mpressive6_0-1619785452459.png

 

 

And playing around with a generate rows tool and formula to replace the page number with pages 1 - 100 in the JSON doesn't work. I then discovered that if I change the page number on the image input side then I get the desired results. 

 

mpressive6_1-1619785452446.png

 

 

cgoodman3
13 - Pulsar
13 - Pulsar

If you want to bring in the whole document, you don’t need the template tool. The T anchor on the image to text tool is optional.

mpressive6
5 - Atom

I guess I'm not understanding your proposal (T anchor - optional). I would like to be able to annotate the entire document (all pages) using the Image Template tool or some other method to look for a specific term. I was able to get a basic workflow (proof of concept) working where I took a document that I knew ahead of time. Using the Image Template tool, I annotated the document and section of that document with the search criteria to included the terms that I was looking for.

 

What I would like to do is to have a workflow that will annotate an entire document (looking for the key terms on all pages), because the term(s) may appear in different document sections of my population (>1000 PDF's). Each PDF may have a different page count total so just annotating a particular section/page # will probably miss allot of information.

 

Your proposed solution at the beginning of the thread where it appears using the Formula tool you were able to pass a range of pages to Image to Text JSON to all pages I believe would work for my business case. Specifically, you mentioned the following "I then discovered that if I change the page number on the image input side then I get the desired results" 

 

I was hoping that you could share the method/approach workflow (genericizing the data of course) to perform the highlighted text above. Hopefully this clarifies what I'm trying to accomplish. Thank you in advance. 

bensilv
Alteryx
Alteryx

Hi @cgoodman3 

 

I have tried your method, and unless the latest IS package prevents this workaround from working, I cannot get it to function.

 

When I change page to 1, the image to text tool only extracts the same page each time.

 

In my example, I have a 7 page PDF (which I use as the template) I have marked up just page 1 of the template, then attempted to apply that mark up to all 7 pages. When I set page = 1, strangely it only extracts data for page 7 (for each of the 7 records) thus giving me 7 duplicate extractions.

 

Any ideas?

cgoodman3
13 - Pulsar
13 - Pulsar

@bensilv Is this 2021.2? Let me see if I can replicate it, the verison it worked in was 2020.2, but there might have been underlying changes in the code base due to the table extraction etc.

bensilv
Alteryx
Alteryx

@cgoodman3 yes with 2021.2, with the image tools in the "Computer Vision" category now. I ran through this with another Alteryx user and the result was the same.

 

Strangely, with my 7 page PDF, when I use the page=1 logic, it duplicates page 7 each time. 

Paul-Evans
9 - Comet

@cgoodman3 & @bensilv have you figured out an update to the workaround? I'm seeing the same duplication of the last page with 2021.3.