Join the Alteryx Community’s Maveryx Summer Cup event! Compete, network with others, and earn your gold through a series of challenges from July 24th to August 11th. Learn more about the event here.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

PDF to Text -- How to ignore the first part of page and apply template to all pages

hellyars
13 - Pulsar

I have PDF reports that are 2-N pages in length.

I use the PDF to Text tool to import the text into Alteryx.

Unfortunately, the source PDF is typically a poor quality scan, but I can work with it.

 

The PDF to Text tool is configured to Read Text Context Only and output as Lines.  I found applying Text Recognition in Adobe produces better results than the Read Text and Image Content setting in the Alteryx tool, blah blah blah...

 

Here is my problem? 

I wan to ignore the 1/5 of each page.

I tried using an Image Template Tool connected to the T input anchor.

I created a template using only the first page of the document, and using the Image Template Tool highlighted the lower 4/5 of the page and assigned it a value of Body_Text.  But, that did not work.  The workflow with the template attached to the PDF to Text tool only processes the first page of the document (not all 14). 

 

A few other details...

While the structure of each report remains the same, the content varies considerably (hence I can only really call the lower 4/5 of each page 'body_text' ).

 

 

How

 

 

 

5 REPLIES 5
PhilipMannering
16 - Nebula
16 - Nebula

Hi @hellyars 

 

You drew a box on the first page of the template and it returned text for the first page. I agree that there should be an option to return text for every page. I've attempted to create a macro that creates markup to apply to every page that's listed in the "List of Pages" Text Input Tooll. Please see attached and let me know if it works!

PhilipMannering_0-1681119006214.png

 

hellyars
13 - Pulsar

I updated the list of pages to match the # in my test document, but the macro drop down is blank.

PhilipMannering
16 - Nebula
16 - Nebula

Ah, try this...

hellyars
13 - Pulsar

 

I must be doing something wrong. 

 

Image template is pointed to my PDF (14 pages). 

I used the annotate function to create a field called BODY (that represents the lower 4/5 of the page)

I extended List of Pages to 14.

But, I still cant enter anything into the List of Pages question resulting in the "No valid fields were selected" error.

 

hellyars_0-1681739853960.png

 

 

 

PhilipMannering
16 - Nebula
16 - Nebula

@hellyars  Sorry this is such a faff. Can you try changing the data type of the List of Pages to an integer? I just think the macro doesn't accept bytes as a potential field in the drop down (which I should change).

Labels