Data Science

EmilyVA · ‎11-28-2022

PDFs hold tons of valuable information that we’d like to set free using the power of Alteryx! And they are so ubiquitous that they feel familiar and easy. But when the Alteryx Intelligence Suite team sat down to design our new PDF to Text tool, we realized there was a lot more to the Portable Document Format than meets the eye. That complexity shaped the choices we made as we designed the new tool. We hope pulling back the curtain on that process will be interesting and helpful as you start using the tool!

Source: GIPHY

What is a PDF, anyway?

Fundamentally, a PDF is a file created following the rules in the Portable Document Format. The PDF specification was first introduced by Adobe in 1993 and was released as an open standard managed by the International Organization for Standardization (ISO) in 2008. The current version of the ISO standard for PDFs is almost 1000 pages long, and between the original introduction and the current standard, there have been several intermediate specifications. These standards have, in turn, been implemented by many different PDF writing programs that made different choices in how to apply the specifications. The result of this evolution over time and the flexibility of the 1000-page standard:

Two identical-looking PDFs can have very different internal structures and content.

Source: GIPHY

If you’ve ever tried to open up a PDF with a text editor to look for the text and other elements that you see with a PDF viewer, you may have experienced something like this:

Source: GIPHY

That being said, any given PDF file may contain some of the following elements:

Bitmap graphics (photographs, scans, other images specified pixel-by-pixel)
Vector graphics (instructions for creating drawings using shapes and lines)
Text stored as content streams (instructions on where and how to draw text on the page)
Multimedia objects, links, and other embedded content
Fonts packaged with the file so they can travel with the document
Instructions for how and where to draw or embed each element on each page

When it comes specifically to text, there is a spectrum of approaches to creating PDFs that made it more complicated for us to design a good PDF text extraction tool:

Common PDF Creation Techniques	Implications for Text Storage and Extraction
Taking a picture or scanning a document	Text is stored as bitmap graphics and requires Optical Character Recognition (OCR) to extract text
Using OCR to overlay transparent text on top of a scanned or photo-based document	Text appears twice in the document - once as bitmap graphics in the image, and again as an invisible text content overlay to support copy-pasting and searching
Optimizing PDF size by converting characters in a non-typical font into vector graphics (drawings of the letters) instead of embedding the whole font in the document	Text is stored as vector graphics and requires OCR to extract text
Combining pictures of text, drawings of text, and text content on a single page	Text is stored as bitmap graphics, vector graphics, and text content, so extracting all the words requires both reading the text content and applying OCR to the text stored as bitmap and vector graphics
Writing a digital “True PDF” document with all text stored as text content	Huzzah! Text content extraction will retrieve all the text in this document! (Unless there are words embedded in images like logos or diagrams or pictures.)

Source: GIPHY

Bringing PDFs into Alteryx: The Original Tools

In 2020, Alteryx Intelligence Suite was launched with tools designed to extract data from PDFs. In our original approach, we first convert all PDFs to images using Image Input. Then we apply OCR to the image of each page using Image to Text. This is great because it always works, regardless of variability in how the PDF was created!

However, even an excellent OCR model applied to the most pristine images of text only has ~97% accuracy. Which is also great! But if a page of text has hundreds of characters, small inaccuracies may accumulate. (Also, the OCR models can be a bit slow.) Since at least some PDFs have text content that might be read directly (and quickly! with near 100% accuracy, in most cases!), we started to wonder if there might be a way to bring that text content into Alteryx.

Source: GIPHY

Bringing PDFs into Alteryx: The Next Generation

Enter: PDF to Text! Our initial goal with PDF to Text was just to extract the text content from PDF documents. Then we met the invoice below:

This is a real invoice that Alteryx was sent by one of our vendors (although all the names and numbers have been anonymized for everyone’s privacy). For this page, text content alone will get us about half the text on this page, but the rest of the text is stored as graphic content. And depending on the use case, the text content might contain everything we need, or…. it might not.

Source: GIPHY

So we realized we needed to do a few things:

Give users the ability to combine text content with OCR results from the graphic content of each page. We called this “magic” internally during the development process, as it took some creative thinking to make the solution work. This is the Read Text and Image Content Text Extraction Option in PDF to Text. It gives the most complete and accurate result for text on the page but takes a bit longer (~1-2 seconds per page, depending on the document and your computer hardware).

Source: GIPHY

Give users the ability to Read Text Content Only for the times when all the content they care about is available as text content, and they don’t want to take the time to run OCR on each page. This can be much faster (~0.2 - 1 second per page, again depending on the document and your computer hardware)! But also… a little scary! Because it’s hard to tell what you might be missing in graphic text!

Source: GIPHY

Give users guard rails that will let them experiment with Read Text Content Only while assessing whether they might be losing critical content present as graphic text. Specifically:
- Output Image of Page Graphics results in an image BLOB (binary large object) in the Image output column with the Output Option column value “pdf graphics”. This image can be rendered by connecting an Image tool with the Get Image from Binary Data in Field option and visually inspected with a Browse tool attached to the Image tool. It shows only what is “left behind” by the text content extraction.

- Risk Score for Text Encoded as Graphics goes one step further and applies OCR to only the graphic elements of each page. It counts the number of graphic text words and outputs that in the Graphic Text Word Count column. It also assigns a Graphic Text Risk level to each page based on that word count.
  - 9 or fewer graphic text words (such as might be found in a logo): “low” risk
  - 10-29 words: “medium” risk
  - 30 or more words: “high” risk

We developed those thresholds by looking at a representative set of documents, but you can calibrate your own risk levels using the raw word counts and images of page graphics for your documents and assign those risk levels using a Formula tool. You can also use the Risk level or the Graphic Text Word Count to filter your pages downstream into different processing workflows.

Combining the Read Text Content Only option with the Risk Score for Text Encoded as Graphics option is not significantly faster than the Read Text and Image Content option, as both are reading in text content and applying OCR to each page. This combination does, however, give users the opportunity to explore what risks they would be taking if they implemented Read Text Content Only without the risk score in exchange for the speed improvements that come with dispensing with the OCR.

Source: GIPHY

We also give users the ability to Preview what the Read Text Content Only vs. Read Text and Image Content options might extract. When a single file is selected with the “Browse” button in the PDF to Text configuration window, the Preview window below will show what content each text extraction option can access. For instance, in the example below we can see that for this file, most of the text would be extracted by Read Text Content Only (right), but text embedded in the images of the toolbars will be skipped (for better or for worse, depending on the way the data will be used downstream).

A bonus of Read Text Content Only mode: more languages! The OCR used in Read Text and Image Content and Risk Score for Text Encoded as Graphics uses the languages specified in the Language selection to refine its results. However, the text content extraction is reading characters directly from the PDF, and as long as it can read those characters, it does not care what language they are from!

Source: GIPHY

Conclusion

Thanks for joining us on this journey through the inner space of PDFs and the resulting options we’ve provided in PDF to Text! We’re looking forward to seeing what you can do with the tool!

To find additional resources on the AIS tools, click here:

Data Science

Introducing: PDF to Text

What is a PDF, anyway?

Bringing PDFs into Alteryx: The Original Tools

Bringing PDFs into Alteryx: The Next Generation

Conclusion

Texto no PDF

Betreff: PDF Input

PDF to Text

PDF to Text Subdivided Data Issue

PDF to Text