Do you use Alteryx in a language other than English? If so, we want to hear from you! Please help us improve the international experience of our products by participating in this 5 minute survey.

We are updating the requirements for Community registration. As of 7/21/21 all users will be required to register a phone number with their My Alteryx accounts. If you have already registered, you will be prompted on your next login to add your phone number.

Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Parsing Text From PDF Documents with Python Code Tool

DavidM
Alteryx
Alteryx

Just recently, Alteryx's one and only @ShaanM posted a brilliant write up How to use R and Python to Parse Word Documents.

As a next logical step to parsing Word documents, I thought about exploring the possibilities of using the Python Code tool to parse text from PDF documents.

 

Intro

This comes quite frequently as a request from our customers and partners who are trying to unlock the value hidden in PDF documents.

Until just recently, our approach would be to use the DocToText app triggered by the Run Command Tool to parse the text from PDF documents.

 

This approach works fine but it is not always optimal.

First, its a bit of a black box - you cannot really tweak the app. Second, it may be a problem to deploy the app beyond the firewall for security purposes. Third, it is not as much fun to use it as when you design your own solution directly in Alteryx.

 

Solution 

We are using a Python Code tool with pdfminer.six package to extract text from PDF.

Once the text is extracted from your PDF, you can use the standard tools from Alteryx Designer to further analyze the text and parse it.

 

In my workflow, I just use the simple TextToColumn to convert one text field into rows separated by the "\n" delimiter.

You can find the workflow at the bottom of this post together with the sample "foo.pdf". 

 

Note: Make sure you specify the path to your file in the Python tool. Should be easy. Shame on me, I could have wrapped this into a Macro tool like Shaan in his post. 

Also, the workflow may lose the code in the Python Code tool once you open it on your PC. This seems to be a bug in 2018-3 version of the Code tool...

 

If you have any problems opening it, just let me know and I can send you a workflow in ZIP file which seems to solve the problem.

Or you can just use the code from within the post to copy paste it into your tool.

 

image.png

 

The code used in the Python tool

 

#import the Alteryx package
from ayx import Alteryx

#Run this once only to install the package
Alteryx.installPackages("pdfminer.six")
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
import os 
import pandas

#function that converts PDF to text
#optional parameter PAGES can restrict which pages to process
def convert_pdf_to_txt(path, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)
    
    #Instantiate the PDFminer objects
    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)
    
    infile = open(path, 'rb') #open the file for read in binary mode
    
    for page in PDFPage.get_pages(infile, pagenums):     #iterate with the pdf interpreter through the pdf pages
        interpreter.process_page(page)
    
    #Close the files and converters
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close()
    
    return text   #return the text as str

text = convert_pdf_to_txt('//Mac/Google Drive/__Alteryx/foo.pdf') #call the function for the file specified here

df = pandas.DataFrame({"text":[text]}) #convert the TEXT str to Panda's DF

df

Alteryx.write(df,1) #Write to output 1 from the tool

image.png

 

PDFMiner Package

For those of you not familiar with the package - PDFMiner is a tool for extracting information from PDF documents. More on the package at github.

 

Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows you to obtain the exact location of texts in a page, as well as other information such as fonts or lines.  

It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.

 

Closing notes

Using Python Code tool in Alteryx Designer ever since its release in 2018.3 has been great fun and a great learning opportunity for me.

There are millions of possibilities in which our community can extend the functionality of Alteryx Designer. This post is supporting just that claim.

PDFminer can actually do a lot more beyond just converting PDF to TEXT. This post does not explore those possibilities but it may be worth doing in the future.

 

Textract alternative (aka I lost my sleep over this)

There are other packages like textract for Python that abstract a bit more on what is the input document. It can parse tons of file types like csv, doc, docx, epub, html, odt, jpeg, png, pdf, tiff, xls, and many others. I was originally trying to go this direction as this would generalize the solution for many many files but lost my sleep over the dependencies this uses. To establish the environment on Linux or MacOs is kind of OK (using apt-get or brew), but once you go Windows it seems to be not documented and generally something that would send my blood pressure off the charts (again).

 

Next Challenge

Take that as a challenge for the next time to try use TEXTRACT 🙂 anyone wanting to beat us to it? Make sure you let us know. That would be huge!

 

Cheers everyone,

David Matyas
Sales Engineer
Alteryx
22 REPLIES 22
DavidM
Alteryx
Alteryx

Hi @fgilbonio,

 

I think that your path is incorrect and needs to be types with / instead of \ in your python Code.

 

//Mac/Some Folder/Alteryx/foo.pdf

One of the approaches you can use to read the path of file supplied as connection #1 to your Python tool could look like this:

 

 

#Read the path to of ZIP
df = Alteryx.read("#1")

# Load the params from the input
path = "" #Placeholder for folder with a script

for index, row in df.iterrows():
    path = row[0].replace("\\","/")

print(path)

 

 

In path variable you will then have the file path loaded a fixed of your issue, i.e. replaced \\ with / which Python needs.

 

Just supply the path variable to the code where originally I used a constant value.

 

If you want to check out how to build that workflow, I used the control param + text tool for instance in this article

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Reading-Multiple-Files-Of-Your-Choice-...

rather than using ACTION tools.

 

Note: Sorry, can't open your zips for security purposes.

David Matyas
Sales Engineer
Alteryx
Idyllic_Data_Geek
8 - Asteroid

@DavidM Can you please show a screenshot of how to connect to the folder containing PDF files from within the python?

Idyllic_Data_Geek
8 - Asteroid

How do I install the 

Rcpp

Pdftools packages on a company laptop without the administrator grant?

Labels