Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

PDF to Text -- How to ignore the first part of page and apply template to all pages

hellyars
13 - Pulsar

I have PDF reports that are 2-N pages in length.

I use the PDF to Text tool to import the text into Alteryx.

Unfortunately, the source PDF is typically a poor quality scan, but I can work with it.

 

The PDF to Text tool is configured to Read Text Context Only and output as Lines.  I found applying Text Recognition in Adobe produces better results than the Read Text and Image Content setting in the Alteryx tool, blah blah blah...

 

Here is my problem? 

I wan to ignore the 1/5 of each page.

I tried using an Image Template Tool connected to the T input anchor.

I created a template using only the first page of the document, and using the Image Template Tool highlighted the lower 4/5 of the page and assigned it a value of Body_Text.  But, that did not work.  The workflow with the template attached to the PDF to Text tool only processes the first page of the document (not all 14). 

 

A few other details...

While the structure of each report remains the same, the content varies considerably (hence I can only really call the lower 4/5 of each page 'body_text' ).

 

 

How

 

 

 

5 REPLIES 5
PhilipMannering
16 - Nebula
16 - Nebula

Hi @hellyars 

 

You drew a box on the first page of the template and it returned text for the first page. I agree that there should be an option to return text for every page. I've attempted to create a macro that creates markup to apply to every page that's listed in the "List of Pages" Text Input Tooll. Please see attached and let me know if it works!

PhilipMannering_0-1681119006214.png

 

hellyars
13 - Pulsar

I updated the list of pages to match the # in my test document, but the macro drop down is blank.

PhilipMannering
16 - Nebula
16 - Nebula

Ah, try this...

hellyars
13 - Pulsar

 

I must be doing something wrong. 

 

Image template is pointed to my PDF (14 pages). 

I used the annotate function to create a field called BODY (that represents the lower 4/5 of the page)

I extended List of Pages to 14.

But, I still cant enter anything into the List of Pages question resulting in the "No valid fields were selected" error.

 

hellyars_0-1681739853960.png

 

 

 

PhilipMannering
16 - Nebula
16 - Nebula

@hellyars  Sorry this is such a faff. Can you try changing the data type of the List of Pages to an integer? I just think the macro doesn't accept bytes as a potential field in the drop down (which I should change).

Labels