Alteryx Designer Desktop Discussions

SANTSARK · ‎03-30-2023

We have multiple PDFs (similar format) but the number of pages and number of lines in a section / table may vary. Some tables may extend into multiple pages.

I am using the Image Template tool to annotate the PDF but when I use the output into Image to Text tool, I am not able to capture the table information correctly. Either the tables are truncated or starting from a different location.

Is there a way to indicate the start and end of the table so that the data in the tables are captured correctly?

Thanks

Santanu

BS_THE_ANALYST · ‎03-30-2023

@SANTSARK I'd advise using the PDF to Text tool. It will capture all the information across all of the PDFs. You'll just need to use a tool like the Multi-Row formula to help you capture the start and end points of the tables. Once you've captured it for one PDF, the same logic will apply across all the PDFs that are the giving the you same information (regardless if it spills over multiple pages).

Here's a fantastic link to a solution that will give you the logic you need to build the multi-row formula (if you don't already know):
https://community.alteryx.com/t5/Weekly-Challenge/Challenge-360-Goodbye-Michael-Part-2/td-p/1088787

Inside this weekly challenge there was a great solution given by @PhilipMannering :

This logic will help you identify any start and end points, and be able to capture all the information between them.

Regarding the PDF to Text tool. Configure it like so:

Lastly, the multi-row logic will look something like this. Capture start and end, this then allows you to capture the middle.

All the best,

BS

All the best,
BS
LinkedIN

TheOC · ‎04-05-2023

hey @SANTSARK
@BS_THE_ANALYST covered the main way I would suggest tackling this problem perfectly in his solution, using the PDF to Text tool.

Just in case you aren't on 2022.3, you can also use the Image text tool the exact same way by not supplying a template. This will extract all text, and you will be able to create logic within your workflow to extract the specific information you need. Unfortunately, given the different shapes and sizes PDF documents can come in, it can be hard to provide you with an example, but if you are able to provide me with one of your PDF documents, I'd be happy to help you get started!

Cheers,
TheOC
Connect with me:

Alteryx Designer Desktop Discussions

Image Template - Detecting multiple tables across multiple pages

Re: Row creation

Re: How to select columns dynamically using number...

Re: Batch macro to read 1000+ .xlsx files with var...

Re: Issue when using Block Until Done and Power BI...

Example workflow for setting up a custom list to u...