Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Image Template - Detecting multiple tables across multiple pages

SANTSARK
5 - Atom

We have multiple PDFs (similar format) but the number of pages and number of lines in a section / table may vary. Some tables may extend into multiple pages. 

 

I am using the Image Template tool to annotate the PDF but when I use the output into Image to Text tool, I am not able to capture the table information correctly. Either the tables are truncated or starting from a different location. 

 

Is there a way to indicate the start and end of the table so that the data in the tables are captured correctly? 

 

Thanks

Santanu 

2 REPLIES 2
BS_THE_ANALYST
14 - Magnetar

@SANTSARK I'd advise using the PDF to Text tool. It will capture all the information across all of the PDFs. You'll just need to use a tool like the Multi-Row formula to help you capture the start and end points of the tables. Once you've captured it for one PDF, the same logic will apply across all the PDFs that are the giving the you same information (regardless if it spills over multiple pages).

BS_THE_ANALYST_0-1680212531680.png


Here's a fantastic link to a solution that will give you the logic you need to build the multi-row formula (if you don't already know):
https://community.alteryx.com/t5/Weekly-Challenge/Challenge-360-Goodbye-Michael-Part-2/td-p/1088787 

Inside this weekly challenge there was a great solution given by @PhilipMannering :

BS_THE_ANALYST_1-1680212651091.png


This logic will help you identify any start and end points, and be able to capture all the information between them.

Regarding the PDF to Text tool. Configure it like so:

BS_THE_ANALYST_2-1680212696067.png


Lastly, the multi-row logic will look something like this. Capture start and end, this then allows you to capture the middle. 

BS_THE_ANALYST_0-1680214558649.png

 




All the best,

BS

 

 

 

 

 

TheOC
15 - Aurora
15 - Aurora

hey @SANTSARK 
@BS_THE_ANALYST covered the main way I would suggest tackling this problem perfectly in his solution, using the PDF to Text tool.

Just in case you aren't on 2022.3, you can also use the Image text tool the exact same way by not supplying a template. This will extract all text, and you will be able to create logic within your workflow to extract the specific information you need. Unfortunately, given the different shapes and sizes PDF documents can come in, it can be hard to provide you with an example, but if you are able to provide me with one of your PDF documents, I'd be happy to help you get started!


Bulien
Labels