Data Science

Machine learning & data science for beginners and experts alike.
EmilyVA
Alteryx
Alteryx

PDFs hold tons of valuable information that we’d like to set free using the power of Alteryx! And they are so ubiquitous that they feel familiar and easy. But when the Alteryx Intelligence Suite team sat down to design our new PDF to Text tool, we realized there was a lot more to the Portable Document Format than meets the eye. That complexity shaped the choices we made as we designed the new tool. We hope pulling back the curtain on that process will be interesting and helpful as you start using the tool!

 

Source: GIPHY

 

What is a PDF, anyway?

 

Fundamentally, a PDF is a file created following the rules in the Portable Document Format. The PDF specification was first introduced by Adobe in 1993 and was released as an open standard managed by the International Organization for Standardization (ISO) in 2008. The current version of the ISO standard for PDFs is almost 1000 pages long, and between the original introduction and the current standard, there have been several intermediate specifications. These standards have, in turn, been implemented by many different PDF writing programs that made different choices in how to apply the specifications. The result of this evolution over time and the flexibility of the 1000-page standard:

 

Two identical-looking PDFs can have very different internal structures and content.

 

Source: GIPHY

 

If you’ve ever tried to open up a PDF with a text editor to look for the text and other elements that you see with a PDF viewer, you may have experienced something like this:

 

Source: GIPHY

 

That being said, any given PDF file may contain some of the following elements:

  • Bitmap graphics (photographs, scans, other images specified pixel-by-pixel)
  • Vector graphics (instructions for creating drawings using shapes and lines)
  • Text stored as content streams (instructions on where and how to draw text on the page)
  • Multimedia objects, links, and other embedded content
  • Fonts packaged with the file so they can travel with the document
  • Instructions for how and where to draw or embed each element on each page

 

image-20220817-175140.png

 

When it comes specifically to text, there is a spectrum of approaches to creating PDFs that made it more complicated for us to design a good PDF text extraction tool:

 

Common PDF Creation Techniques

Implications for Text Storage and Extraction

Taking a picture or scanning a document

Text is stored as bitmap graphics and requires Optical Character Recognition (OCR) to extract text

Using OCR to overlay transparent text on top of a scanned or photo-based document

Text appears twice in the document - once as bitmap graphics in the image, and again as an invisible text content overlay to support copy-pasting and searching

Optimizing PDF size by converting characters in a non-typical font into vector graphics (drawings of the letters) instead of embedding the whole font in the document

Text is stored as vector graphics and requires OCR to extract text

Combining pictures of text, drawings of text, and text content on a single page

Text is stored as bitmap graphics, vector graphics, and text content, so extracting all the words requires both reading the text content and applying OCR to the text stored as bitmap and vector graphics

Writing a digital “True PDF” document with all text stored as text content

Huzzah! Text content extraction will retrieve all the text in this document! (Unless there are words embedded in images like logos or diagrams or pictures.)

 

Source: GIPHY

 

Bringing PDFs into Alteryx: The Original Tools

 

In 2020, Alteryx Intelligence Suite was launched with tools designed to extract data from PDFs. In our original approach, we first convert all PDFs to images using Image Input. Then we apply OCR to the image of each page using Image to Text. This is great because it always works, regardless of variability in how the PDF was created!

 

image-20220817-201931.png

 

However, even an excellent OCR model applied to the most pristine images of text only has ~97% accuracy. Which is also great! But if a page of text has hundreds of characters, small inaccuracies may accumulate. (Also, the OCR models can be a bit slow.) Since at least some PDFs have text content that might be read directly (and quickly! with near 100% accuracy, in most cases!), we started to wonder if there might be a way to bring that text content into Alteryx.

 

Source: GIPHY

 

Bringing PDFs into Alteryx: The Next Generation

 

Enter: PDF to Text! Our initial goal with PDF to Text was just to extract the text content from PDF documents. Then we met the invoice below:

 

image-20220817-194408.png

 

This is a real invoice that Alteryx was sent by one of our vendors (although all the names and numbers have been anonymized for everyone’s privacy). For this page, text content alone will get us about half the text on this page, but the rest of the text is stored as graphic content. And depending on the use case, the text content might contain everything we need, or…. it might not.

 

Source: GIPHY

 

So we realized we needed to do a few things:

 

  • Give users the ability to combine text content with OCR results from the graphic content of each page. We called this “magic” internally during the development process, as it took some creative thinking to make the solution work. This is the Read Text and Image Content Text Extraction Option in PDF to Text. It gives the most complete and accurate result for text on the page but takes a bit longer (~1-2 seconds per page, depending on the document and your computer hardware).

 

Source: GIPHY

 

  • Give users the ability to Read Text Content Only for the times when all the content they care about is available as text content, and they don’t want to take the time to run OCR on each page. This can be much faster (~0.2 - 1 second per page, again depending on the document and your computer hardware)! But also… a little scary! Because it’s hard to tell what you might be missing in graphic text!

 

Source: GIPHY

 

  • Give users guard rails that will let them experiment with Read Text Content Only while assessing whether they might be losing critical content present as graphic text. Specifically:
    • Output Image of Page Graphics results in an image BLOB (binary large object) in the Image output column with the Output Option column value “pdf graphics”. This image can be rendered by connecting an Image tool with the Get Image from Binary Data in Field option and visually inspected with a Browse tool attached to the Image tool. It shows only what is “left behind” by the text content extraction.

image-20220826-223905.png

 

    • Risk Score for Text Encoded as Graphics goes one step further and applies OCR to only the graphic elements of each page. It counts the number of graphic text words and outputs that in the Graphic Text Word Count column. It also assigns a Graphic Text Risk level to each page based on that word count.
      • 9 or fewer graphic text words (such as might be found in a logo): “low” risk
      • 10-29 words: “medium” risk
      • 30 or more words: “high” risk

 

We developed those thresholds by looking at a representative set of documents, but you can calibrate your own risk levels using the raw word counts and images of page graphics for your documents and assign those risk levels using a Formula tool. You can also use the Risk level or the Graphic Text Word Count to filter your pages downstream into different processing workflows.

 

Combining the Read Text Content Only option with the Risk Score for Text Encoded as Graphics option is not significantly faster than the Read Text and Image Content option, as both are reading in text content and applying OCR to each page. This combination does, however, give users the opportunity to explore what risks they would be taking if they implemented Read Text Content Only without the risk score in exchange for the speed improvements that come with dispensing with the OCR.

 

Source: GIPHY

 

  • We also give users the ability to Preview what the Read Text Content Only vs. Read Text and Image Content options might extract. When a single file is selected with the “Browse” button in the PDF to Text configuration window, the Preview window below will show what content each text extraction option can access. For instance, in the example below we can see that for this file, most of the text would be extracted by Read Text Content Only (right), but text embedded in the images of the toolbars will be skipped (for better or for worse, depending on the way the data will be used downstream).

 

image-20220829-195021.png

 

  • A bonus of Read Text Content Only mode: more languages! The OCR used in Read Text and Image Content and Risk Score for Text Encoded as Graphics uses the languages specified in the Language selection to refine its results. However, the text content extraction is reading characters directly from the PDF, and as long as it can read those characters, it does not care what language they are from!

 

Source: GIPHY

 

Conclusion

 

Thanks for joining us on this journey through the inner space of PDFs and the resulting options we’ve provided in PDF to Text! We’re looking forward to seeing what you can do with the tool!

To find additional resources on the AIS tools, click here:

  1. Alteryx Intelligence Suite Learning Path
  2. Alteryx Intelligence Suite Tools Help Main Page
Comments
mceleavey
17 - Castor
17 - Castor

noice.gif

simonaubert_bd
13 - Pulsar

And that's why you should use EDI- instead of PDF in your business :D
https://en.wikipedia.org/wiki/Electronic_data_interchange

But seriously now, that's really cool. Thanks for the explanations.

Dynamomo
11 - Bolide

@EmilyVA  Great write-up!

Can you help with a question that came up when I recently demoed this?

"How does Alteryx handle PDFs where there may be malicious code embedded in it?  Would the PDF reader allows a file to execute malicious code?"

Thanks!



EmilyVA
Alteryx
Alteryx

Hi @Dynamomo! Glad you liked the write-up.  We rely on a battle-tested open source PDF library (poppler) which has been in development since 1995. Vulnerabilities in this library are regularly evaluated and patched; the risk of malicious code execution is therefore very low but cannot be guaranteed to be zero.

Ben29a
5 - Atom

PDF to text, but need the images also.   How to get?

the PDF to text tool allows for a setting "Output Image of Page Graphics".

  • Output Image of Page Graphics results in an image BLOB (binary large object) in the Image output column with the Output Option column value “pdf graphics”. This image can be rendered by connecting an Image tool with the Get Image from Binary Data in Field option and visually inspected with a Browse tool attached to the Image tool. It shows only what is “left behind” by the text content extraction

this Blob contains all 'image" related elements of the pdf page.  (line and table layouts, images, logo, ....),   and there is a blob for each page  of the pdf file.

 

How can I singularize each of these blob elements, identify, tag and name them individually so that they can be used as stand alone binary data elements ?

 

See it like this:   the pdf is a product catalogue.   Each page containing several products, containing text description and a product image.   After reading the whole pdf, the objective is to link the product image to its text elements.  For text, I have no problem, the challenge comes with extracting the correct image contained in the blob.    

Anyone ?

EmilyVA
Alteryx
Alteryx

@Ben29a our current approach for this within Intellgence Suite is to use Image Template to tag each of the elements you want.  The template json that comes out of Image Template can be used as input on the T anchor for either Image to Text or PDF to Text to extract individual tagged images.  

Ben29a
5 - Atom
 

Hi Emily,    as shown in below chart,   the bottom process vizualizes the blob images.   1pdf pages = 1 blob.   The blob containing all graphical elements of that 1 pdf page.      Next step is to extract each of these images within the blob so that they can be named and processed individually.   Some will be kept, others deleted.     How to connect the blob flow to the red tools for processing.   Somehow I can't make the connection to have it working.   

 

Alteryx blob 2023-07-03 164352.png

 

 

 

EmilyVA
Alteryx
Alteryx

Hi @Ben29a,

 

 The Image Template + Image Input tool option assumes you'd crop the images you want out of the original PDF page (before the text content is extracted), rather than using the pdf graphics images output from PDF to Text.

 

If you'd prefer to start with pdf graphics pages from Image to Text, another option would be to feed the "pdf graphics" images into the Image Processing tool and use the crop functionality to extract the image components you're looking for.

roughchr
6 - Meteoroid

Hi Emily

 

Is the pdf to text tool available now, and if so in which version/package of Alteryx? I don't see it, perhaps as I don't have intelligence suite installed currently? I noticed that the link to the tool no longer works and wanted to check it was still supported https://help.alteryx.com/20221/designer/pdf-text

 

Thanks

EmilyVA
Alteryx
Alteryx

Hi @roughchr - yes the PDF to Text tool is available!  It's included in Intelligence Suite starting with 2022.2.  The current documentation link is here:  PDF to Text (alteryx.com).