PDF to Text Tool Error Tool didn't extract text from markup region X on page 1 of file Y

Question

Trying to convert a 229 page table from PDF to Text using Intel Suite tools.

Each page has between 13-20 rows per page.

Pages with 19, 18, and 20 rows account for 99% of the total.

I have Image Templates for each row count scenario.

There are 8 fields (columns) on each page.

The 3rd field in NAME.   And while each row in the PDF table represents one record, within that row NAME might have two lines (among many other challenges with auto outputs).

>>> This rules out Output Options without a Template. The output splits PDF single records with two lines in NAME across 1-3 lines.

>>> Using a Template and annotating each field as a column (down)  also does not work.  The resulting output still splits a single PDF record row if it contains more than two lines for NAME across 2-3 out put rows.

>>> Only using a Template and annotating each row (across) for each row in the PDF maintains the link between output records split across multiple rows.  In this configuration, the Region and Row fields act to identify the relationship to one another.

BUT I still get the warning "Tool didn't extract text from markup region X on page 1 of file Y."

The latest run produced 65 + warnings -- i.e., 65 rows that were not extracted.

And it's always a handful of rows (regions) that are the culprit, such as 8,9,11,13 - but not always together.

Now I have hand drawn these region boxes 8 or more times and I can't seem to get rid of these missed extractions (always regions/rows in the middle of the PDF).

What causes these missed extractions?  I thought maybe a region was too narrow or slightly overlapped -- so I zoomed in to a pretty ridiculous level and that still does not work.

Any thoughts on ways to mitigate?

On a positive note, the PDF tools have improved greatly over the last year and are now wicked fast.  230 pages takes less than 30 seconds.

hellyars · Accepted Answer

The Good:

Using horizontal annotations in the Image Template tool across each record (line) in the source material maintains the relationship between multiple output rows for a given record through the Region field.  But for the missing record extracts, I had a solution to piece split records back together.

The Bad & Ugly:

I kept running into the missing extract warning (i.e., missing records in the output).

The potential work around was to run separate iterations of the workflow specifically targeting the problem row, with a separate iteration for each individual problem row.

The large source material already required 4 separate image template/output iterations to account for the different # of rows found across the document.

The work around would now require each template to be split into 3 or more sub-templates to account for : rows above the problem row(s), a template for each individual problem row, and a template for the rows below the problem row(s).

The work around could end up 3-7 templates for each of the 4 scenarios - i.e., 21 to 28 templates or more in total.

The Solution

AWS Textract killed two birds with one stone:

1) it accurately extracted all table rows in one go, and

2) there were no split records (i.e., it correctly interpreted the NAME field when it included two lines in each record).

The lack of split rows simplified the post import editing using Alteryx.

I only wish I had switched earlier  I spent hours trying to get / force the pure Alteryx solution.

The Reality

This time AWS Textract was the better solution.

BUT, PDF variability means what works for one PDF might not work for another.

It depends on the structure and quality of the document (and some luck).

And with public documents (as in the case here and below) document variability (e.g., image scans) across and even within a document is a challenge.

Last week the Alteryx PDF to Text was the better tool when working with an equally large PDF (image scan).

In that scenario, the Alteryx tool quickly and more accurately extracted data than AWS Textract -- requiring less time to process/edit in a scenario where the client was counting the minutes between document release and data delivery.

Being able to, and knowing when to switch between tools is the solution.