PDFs hold tons of valuable information that we’d like to set free using the power of Alteryx! And they are so ubiquitous that they feel familiar and easy. But when the Alteryx Intelligence Suite team sat down to design our new PDF to Text tool, we realized there was a lot more to the Portable Document Format than meets the eye. That complexity shaped the choices we made as we designed the new tool. We hope pulling back the curtain on that process will be interesting and helpful as you start using the tool!
Source: GIPHY
Fundamentally, a PDF is a file created following the rules in the Portable Document Format. The PDF specification was first introduced by Adobe in 1993 and was released as an open standard managed by the International Organization for Standardization (ISO) in 2008. The current version of the ISO standard for PDFs is almost 1000 pages long, and between the original introduction and the current standard, there have been several intermediate specifications. These standards have, in turn, been implemented by many different PDF writing programs that made different choices in how to apply the specifications. The result of this evolution over time and the flexibility of the 1000-page standard:
Two identical-looking PDFs can have very different internal structures and content.
Source: GIPHY
If you’ve ever tried to open up a PDF with a text editor to look for the text and other elements that you see with a PDF viewer, you may have experienced something like this:
Source: GIPHY
That being said, any given PDF file may contain some of the following elements:
When it comes specifically to text, there is a spectrum of approaches to creating PDFs that made it more complicated for us to design a good PDF text extraction tool:
Common PDF Creation Techniques |
Implications for Text Storage and Extraction |
Taking a picture or scanning a document |
Text is stored as bitmap graphics and requires Optical Character Recognition (OCR) to extract text |
Using OCR to overlay transparent text on top of a scanned or photo-based document |
Text appears twice in the document - once as bitmap graphics in the image, and again as an invisible text content overlay to support copy-pasting and searching |
Optimizing PDF size by converting characters in a non-typical font into vector graphics (drawings of the letters) instead of embedding the whole font in the document |
Text is stored as vector graphics and requires OCR to extract text |
Combining pictures of text, drawings of text, and text content on a single page |
Text is stored as bitmap graphics, vector graphics, and text content, so extracting all the words requires both reading the text content and applying OCR to the text stored as bitmap and vector graphics |
Writing a digital “True PDF” document with all text stored as text content |
Huzzah! Text content extraction will retrieve all the text in this document! (Unless there are words embedded in images like logos or diagrams or pictures.) |
Source: GIPHY
In 2020, Alteryx Intelligence Suite was launched with tools designed to extract data from PDFs. In our original approach, we first convert all PDFs to images using Image Input. Then we apply OCR to the image of each page using Image to Text. This is great because it always works, regardless of variability in how the PDF was created!
However, even an excellent OCR model applied to the most pristine images of text only has ~97% accuracy. Which is also great! But if a page of text has hundreds of characters, small inaccuracies may accumulate. (Also, the OCR models can be a bit slow.) Since at least some PDFs have text content that might be read directly (and quickly! with near 100% accuracy, in most cases!), we started to wonder if there might be a way to bring that text content into Alteryx.
Source: GIPHY
Enter: PDF to Text! Our initial goal with PDF to Text was just to extract the text content from PDF documents. Then we met the invoice below:
This is a real invoice that Alteryx was sent by one of our vendors (although all the names and numbers have been anonymized for everyone’s privacy). For this page, text content alone will get us about half the text on this page, but the rest of the text is stored as graphic content. And depending on the use case, the text content might contain everything we need, or…. it might not.
Source: GIPHY
So we realized we needed to do a few things:
Source: GIPHY
Source: GIPHY
We developed those thresholds by looking at a representative set of documents, but you can calibrate your own risk levels using the raw word counts and images of page graphics for your documents and assign those risk levels using a Formula tool. You can also use the Risk level or the Graphic Text Word Count to filter your pages downstream into different processing workflows.
Combining the Read Text Content Only option with the Risk Score for Text Encoded as Graphics option is not significantly faster than the Read Text and Image Content option, as both are reading in text content and applying OCR to each page. This combination does, however, give users the opportunity to explore what risks they would be taking if they implemented Read Text Content Only without the risk score in exchange for the speed improvements that come with dispensing with the OCR.
Source: GIPHY
Source: GIPHY
Thanks for joining us on this journey through the inner space of PDFs and the resulting options we’ve provided in PDF to Text! We’re looking forward to seeing what you can do with the tool!
To find additional resources on the AIS tools, click here:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.