community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Parsing Text From PDF Documents with Python Code Tool

Alteryx
Alteryx

Just recently, Alteryx's one and only @ShaanM posted a brilliant write up How to use R and Python to Parse Word Documents.

As a next logical step to parsing Word documents, I thought about exploring the possibilities of using the Python Code tool to parse text from PDF documents.

 

Intro

This comes quite frequently as a request from our customers and partners who are trying to unlock the value hidden in PDF documents.

Until just recently, our approach would be to use the DocToText app triggered by the Run Command Tool to parse the text from PDF documents.

 

This approach works fine but it is not always optimal.

First, its a bit of a black box - you cannot really tweak the app. Second, it may be a problem to deploy the app beyond the firewall for security purposes. Third, it is not as much fun to use it as when you design your own solution directly in Alteryx.

 

Solution 

We are using a Python Code tool with pdfminer.six package to extract text from PDF.

Once the text is extracted from your PDF, you can use the standard tools from Alteryx Designer to further analyze the text and parse it.

 

In my workflow, I just use the simple TextToColumn to convert one text field into rows separated by the "\n" delimiter.

You can find the workflow at the bottom of this post together with the sample "foo.pdf". 

 

Note: Make sure you specify the path to your file in the Python tool. Should be easy. Shame on me, I could have wrapped this into a Macro tool like Shaan in his post. 

Also, the workflow may lose the code in the Python Code tool once you open it on your PC. This seems to be a bug in 2018-3 version of the Code tool...

 

If you have any problems opening it, just let me know and I can send you a workflow in ZIP file which seems to solve the problem.

Or you can just use the code from within the post to copy paste it into your tool.

 

image.png

 

The code used in the Python tool

 

#import the Alteryx package
from ayx import Alteryx

#Run this once only to install the package
Alteryx.installPackages("pdfminer.six")
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
import os 
import pandas

#function that converts PDF to text
#optional parameter PAGES can restrict which pages to process
def convert_pdf_to_txt(path, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)
    
    #Instantiate the PDFminer objects
    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)
    
    infile = open(path, 'rb') #open the file for read in binary mode
    
    for page in PDFPage.get_pages(infile, pagenums):     #iterate with the pdf interpreter through the pdf pages
        interpreter.process_page(page)
    
    #Close the files and converters
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close()
    
    return text   #return the text as str

text = convert_pdf_to_txt('//Mac/Google Drive/__Alteryx/foo.pdf') #call the function for the file specified here

df = pandas.DataFrame({"text":[text]}) #convert the TEXT str to Panda's DF

df

Alteryx.write(df,1) #Write to output 1 from the tool

image.png

 

PDFMiner Package

For those of you not familiar with the package - PDFMiner is a tool for extracting information from PDF documents. More on the package at github.

 

Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows you to obtain the exact location of texts in a page, as well as other information such as fonts or lines.  

It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.

 

Closing notes

Using Python Code tool in Alteryx Designer ever since its release in 2018.3 has been great fun and a great learning opportunity for me.

There are millions of possibilities in which our community can extend the functionality of Alteryx Designer. This post is supporting just that claim.

PDFminer can actually do a lot more beyond just converting PDF to TEXT. This post does not explore those possibilities but it may be worth doing in the future.

 

Textract alternative (aka I lost my sleep over this)

There are other packages like textract for Python that abstract a bit more on what is the input document. It can parse tons of file types like csv, doc, docx, epub, html, odt, jpeg, png, pdf, tiff, xls, and many others. I was originally trying to go this direction as this would generalize the solution for many many files but lost my sleep over the dependencies this uses. To establish the environment on Linux or MacOs is kind of OK (using apt-get or brew), but once you go Windows it seems to be not documented and generally something that would send my blood pressure off the charts (again).

 

Next Challenge

Take that as a challenge for the next time to try use TEXTRACT :-) anyone wanting to beat us to it? Make sure you let us know. That would be huge!

 

Cheers everyone,

David Matyas
Sales Engineer
Alteryx
Alteryx
Alteryx

Brilliant write up @DavidM

 

I did something similar a while back using R to parse pdf 

 

https://community.alteryx.com/t5/Alteryx-Knowledge-Base/PDF-Parsing-in-Alteryx-using-R/ta-p/82627

 

David's solution using Python takes it to another level and will be a lot more elegant once all pre req components are setup

 

Great work

 

 

Alteryx Certified Partner

Great write up @DavidM!  Only thing is I continually get this error:

Python (2) [NbConvertApp] Converting notebook C:\ProgramData\Alteryx\Engine\9f1aa7d6-30fa-4f70-b9ff-cd83d81a3f08\2\workbook.ipynb to html ¶[NbConvertApp] Executing notebook with kernel: python3 ¶[NbConvertApp] ERROR | Error while converting 'C:\ProgramData\Alteryx\Engine\9f1aa7d6-30fa-4f70-b9ff-cd83d81a3f08\2\workbook.ipynb' ¶Traceback (most recent call last): ¶ File "c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\nbconvert\nbconvertapp.py", line 393, in export_single_notebook ¶ output, resources = self.exporter.from_filename(notebook_filename, resources=resources) ¶ File "c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\nbconvert\exporters\exporter.py", line 174, in from_filename ¶ return self.from_file(f, resources=resources, **kw) ¶ File "c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\nbconvert\exporters\exporter.py", line 192, in from_file ¶ return self.from_notebook_node(nbformat.read(file_stream, as_version=4), resources=resources, **kw) ¶ File "c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\nbconvert\exporters\html.py", line 85, in from_notebook_node ¶ return super(HTMLExporter, self).from_notebook_node(nb, resources, **kw) ¶ File "c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\nbconvert\exporters\templateexporter.py", line 280, in from_notebook_node ¶ nb_copy, resources = super(TemplateExporter, self).from_notebook_node(nb, resources, **kw) ¶ File "c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\nbconvert\exporters\exporter.py", line 134, in from_notebook_node ¶ nb_copy, resources = self._preprocess(nb_copy, resources) ¶ File "c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\nbconvert\exporters\exporter.py", line 311, in _preprocess ¶ nbc, resc = preprocessor(nbc, resc) ¶ File "c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\nbconvert\preprocessors\base.py", line 47, in __call__ ¶ return self.preprocess(nb, resources) ¶ File "c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\nbconvert\preprocessors\execute.py", line 262, in preprocess ¶ nb, resources = super(ExecutePreprocessor, self).preprocess(nb, resources) ¶ File "c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\nbconvert\preprocessors\base.py", line 69, in preprocess ¶ nb.cells[index], resources = self.preprocess_cell(cell, resources, index) ¶ File "c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\nbconvert\preprocessors\execute.py", line 286, in preprocess_cell ¶ raise CellExecutionError.from_cell_and_msg(cell, out) ¶nbconvert.preprocessors.execute.CellExecutionError: An error occurred while executing the following cell: ¶------------------ ¶from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter ¶from pdfminer.converter import TextConverter ¶from pdfminer.layout import LAParams ¶from pdfminer.pdfpage import PDFPage ¶from io import StringIO ¶import os ¶import pandas ¶ ¶#function that converts PDF to text ¶#optional parameter PAGES can restrict which pages to process ¶def convert_pdf_to_txt(path, pages=None): ¶ if not pages: ¶ pagenums = set() ¶ else: ¶ pagenums = set(pages) ¶ ¶ #Instantiate the PDFminer objects ¶ output = StringIO() ¶ manager = PDFResourceManager() ¶ converter = TextConverter(manager, output, laparams=LAParams()) ¶ interpreter = PDFPageInterpreter(manager, converter) ¶ ¶ infile = open(path, 'rb') #open the file for read in binary mode ¶ ¶ for page in PDFPage.get_pages(infile, pagenums): #iterate with the pdf interpreter through the pdf pages ¶ interpreter.process_page(page) ¶ ¶ #Close the files and converters ¶ infile.close() ¶ converter.close() ¶ text = output.getvalue() ¶ output.close() ¶ ¶ return text #return the text as str ¶ ¶text = convert_pdf_to_txt('C:\Users\Chad\OneDrive - Data Prep U, LLC\YouTube Videos\Alteryx - PDF Parse\foo.pdf') #call the function for the file specified here ¶ ¶df = pandas.DataFrame({"text":[text]}) #convert the TEXT str to Panda's DF ¶ ¶df ¶ ¶Alteryx.write(df,1) #Write to output 1 from the tool ¶------------------ ¶ ¶ File "<ipython-input-3-4edddfce9acb>", line 36 ¶ text = convert_pdf_to_txt('C:\Users\Chad\OneDrive - Data Prep U, LLC\YouTube Videos\Alteryx - PDF Parse\foo.pdf') #call the function for the file specified here ¶ ^ ¶SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape ¶ ¶SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape (<ipython-input-3-4edddfce9acb>, line 36) ¶ ¶

 

I am using this with the foo.pdf example you also sent.  The Python package appears to have installed properly, any ideas? 

Alteryx
Alteryx

Hey @DataPrepChad,

 

Thanks!

 

that is not very readable :-) can you try to run the code interactively with the RUN button within the Python Code tool rather than RUN button of a workflow so we can trace which line does that? 

 

try removing the lines unless you get this working, then try adding them back in.

 

also, you can try to install Anaconda Navigator and try the code in Jupyter notebooks just to test if the code works for you outside Alteryx.

 

dm

David Matyas
Sales Engineer
Alteryx
Alteryx Certified Partner

Ahh, good call @DavidM (wasn't aware of the built-in run).  Checking this out:

OSError                                   Traceback (most recent call last)
<ipython-input-5-9c785d0e5b5c> in <module>()
     34     return text   #return the text as str
     35 
---> 36 text = convert_pdf_to_txt('C:\Temp\foo.pdf') #call the function for the file specified here
     37 
     38 df = pandas.DataFrame({"text":[text]}) #convert the TEXT str to Panda's DF

<ipython-input-5-9c785d0e5b5c> in convert_pdf_to_txt(path, pages)
     21     interpreter = PDFPageInterpreter(manager, converter)
     22 
---> 23     infile = open(path, 'rb') #open the file for read in binary mode
     24 
     25     for page in PDFPage.get_pages(infile, pagenums):     #iterate with the pdf interpreter through the pdf pages

OSError: [Errno 22] Invalid argument: 'C:\\Temp\x0coo.pdf'

Interesting that it says 'invalid argument' for a file 'x0coo.pdf'.  I tried to look through the code to see if there was anything else I missed for calling the sample file, but didn't see anything.   

Meteoroid

I am getting the same error message as @DataPrepChad  with similar situation.  Everything looks to be setup correct.  

Meteoroid

Saw your suggestion @DavidM and found the error inside the Python tool.  I had a Unicode error, once I escaped all my slashes it worked like a charm. 

 

@DataPrepChad looking at your error, it would seem to be a similar situation.  \U starts an eight-character Unicode escape. Just duplicate all of your backslashes in your folder path and it should work

Alteryx Certified Partner

@NBart great find!  I took it a different step and reversed my slashes, as was in @DavidM's example (//mac/etc...).

 

One item of note, and I can't replicate it (but am trying) is that at one point I clicked out of the tool and back in and all the code was gone.  Will keep trying to replicate, but otherwise this is great!

Alteryx
Alteryx

Guys, also try the parsing of PDF tabular data :-)

 

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Extracting-Tabular-Data-from-PDF-Docum...

David Matyas
Sales Engineer
Alteryx
Meteoroid

@DavidM Just skimmed your summary and realized I need to dive into that one after lunch.  It looks very promising for some high priority use cases we are solving with a very expensive alternative.

Labels