Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Knowledge Base

Definitive answers from Designer Desktop experts.

Can Alteryx Parse A Word Doc Or PDF?

ChadM
Alteryx Alumni (Retired)
Created

One of the biggest reasons why people love Alteryx is that it has the ability to read a very large number of different data sources.  One limitation is that it cannot read in a pdf or word doc without a little help from another source.  Why would someone want to do this?  Well, one excellent example would be to parse a folder full of resumes to search for specific text. 

Why can't Alteryx read them natively? These file types are not standard data formats, so in order to read them, we must first convert them to a plain text file.  To convert, there is a free, open-source program called DocToText. This program can be run at the command line to convert these file types to plain text, which Alteryx can read with no issue.

I've included an example attached to this post.  This workflow utilizes an often underused tool, the Run Command tool.  With the help of this tool, we can read in a list of files from a specific source folder, parse the info into something DocToText can use, then use the RunCmd Tool to convert all files to plain text for further consumption.  I've included everything you will need in the attachment (including a folder structure that works well with the module). 

Download and extract the attached .yxzp file, check out the module, and let us know what you think! This example has been updated for version 10.0. You'll notice the package will produce a couple of dependency errors when you extract it. That's ok, it won't error on run.

Attachments
Comments
simon
11 - Bolide

 What about pdf's with images?

levell_x_dunn
10 - Fireball

This is a really cool concept, I was just asked a Client this question and this is a great starting point, thanks Chad

_brent_smith
5 - Atom

This solution worked perfectly for data that is pulled from an instrument and the manufacturer only offers pdf as the output. Thanks for sharing!

Berchalyn
5 - Atom

Thanks for this!

But how does the EXE works in getting the TXT from PDF files?

daniel_mmi
9 - Comet

Fantastic.

 

I was just handed a .zip file that was supposed to contain invoice data from one of our vendors, to be imported into our accounting system. Turns out, it was a folder full of 128 PDF files. In about 30 minutes of fiddling, I had all the text extracted into one file, and then after another hour or so I had parsed, structured data, one line per invoice. As the invoices are currently being manually keyed in by the accounting staff, this should help me make some quick friends.

 

One quirk of the above workflow: If you run it more than once, it appends the data to the temp files, rather than replacing them. I just added a 'Run Command' batch to 'Clear When Finished' (DEL *.pdf.txt), and now it runs perfectly, every time.

 

Thanks! You guys rock.

daniel_mmi
9 - Comet

Berchalyn: http://sourceforge.net/projects/doctotext/

 

A quick thank you to SilverCoders, too, makers of a variety of free, useful, open source tools.

CailinS
Alteryx
Alteryx

I have also used the pdftotext.exe free program to do the intial conversion (I have seen it give results when the doctotext does not).

Atabarezz
13 - Pulsar

I believe this capability to be iincluded in the next release,

Every client I've come accross so far has some issue with reading semi-structured data from PDF files...

 

Best

ChiBK
5 - Atom

 

Thanks for this tool and I'm looking forward to using it. I'm receiving error code 1 at the runbat.bat "run command" tool. Do you have any advice on how to solve this?

 

Thanks,

 

Brad

KaneG
Alteryx Alumni (Retired)

Hi Brad,

 

Try running the batch file from the command prompt. The error that you are getting is basically just saying that the batch command didn't execute properly.

 

Kane

mpate
5 - Atom

I'm getting the "error code 1" message for some of the PDF's that I run it on as well. When I run it directly through the command prompt this is the error that it returns:

Using PDF parser.
Error parsing file. Backtrace:
PDF Stream iterator: invalid array
Error while parsing page number: 0
It is possible that wrong parser was selected. Trying different parsers.
Trying to detect document format by its content.
Creating parser failed.
Cannot unzip file.
Cannot unzip file.
Using PDF parser.
Error parsing file. Backtrace:
PDF Stream iterator: invalid array
Error while parsing page number: 0
Error processing file C:\Conversions\Alteryx\PDFParse\1234-abcd_efgh_stmt.pdf.

 

It only seems to happen for about a 1/3 of the PDF's that i'm trying to parse. Has anyone encountered this by chance?

Su
7 - Meteor

Hello,

 

Is there a way to output to a pdf?

 

Thank you.

Susana

JohnJPS
15 - Aurora

Hi @Su -- yes, you can use the "Render" tool to write a PDF.  You will need to layout the PDF first, using various provided reporting tools to set up tables and arrange layouts as desired.  This can be as easy as one "Table" tool selecting all columns and feeding directly into the "Render" tool, but if desired there are several additional tools for laying out multiple tables, adding headers or footers, and etc...

 

Su
7 - Meteor

Hi @JohnJPS , thanks a lot !

lrygiel
7 - Meteor
Hi all. Is there a native Alteryx PDF parser yet? I'm somewhat a novice at this and not sure if one came out with the current release as mentioned earlier. Thanks.
CailinS
Alteryx
Alteryx

There is not a native PDF parser at this time though hopefully the discussion and example above help!

TaraM
Alteryx Alumni (Retired)

If you want to see this natively supported in a future release, please submit as an idea here:

http://community.alteryx.com/t5/Alteryx-Product-Ideas/idb-p/product-ideas

 

 

Jordan_Carson
6 - Meteoroid

What version will this be available?

JamiesonC
5 - Atom

While we wait for native PDF parsing in Alteryx, another 3rd party application folks may want to try is Tabula.

 

http://tabula.technology/

 

Tabula is free and open-source. The core functionality runs in Java with a web browser front-end for user interaction. It supports both auto-detect and manual selection of table elements, extraction preview, bulk extraction, and multiple output options. I tested it with an 11 MB, 420-page PDF and was quite pleased with how well it handled the beast. Obviously with any PDF data extraction, there is a certain amount of manual effort that will be unavoidable. But I felt that Tabula did a great job of automating as much as it could, and giving me robust control of the process from there.

 

Hope this helps some!

brad_j_crep
8 - Asteroid

I've run into the same issue as mpate.  I'm not sure if it's due to the PDF being a scanned paper document.  If this is the problem, is there a solution or another app I can use or way I can work it to bring in these PDFs and parse them out to csv or xlxs?

 

Any ideas would be welcome!

 

Thanks!

Brad

JamiesonC
5 - Atom

Brad,

 

If the PDF is a scanned paper document, that means that it's really just an image in a PDF wrapper. Consequently, you need to apply OCR software (Optical Character Recognition) to the document. Here are some options. (Disclaimer: I have not tested any of these approaches.)

 

I hope these resources help you build a workflow that works well for you.

simon
11 - Bolide

 

 

Yeah, OCR is next best thing. Doesn't Office have a native OCR tool? Just make sure your scan is high res or your text will be jibberish (e.g. your B's will be 8's).

 

Simon

TaraM
Alteryx Alumni (Retired)
brad_j_crep
8 - Asteroid

Thanks for all the tips!

 

Brad

gnans19
11 - Bolide

This utility strips off leading white spaces. This becomes a problem when my first column is empty. Any configuration to adjust?

carpoolboy
5 - Atom

This has been an ongoing issue that has been a frequent issue for myself.  Now why I typically go-to Alteryx when needing BI solutions is to build out complete autonomous solutions, and the problems I have faced with the solutions found in most of the forums and blogs is that they don't work as well as I would like within Alteryx without requiring some form of human interaction.

 

This is why I came up with a fairly simple and easily automated solution using the R tool and a relatively new package in R called pdftools.  The code is strait forword and the beauty is in the simplicity.  For automation purposes on server the first thing you will want to do is set the working directory, the line of code is easy:

 

setwd("UNC FilePath")

 

Then add a line of code to install packages: (Don't worry so far I haven't came across issues of duping)

 

install.packages("Rcpp",  dependencies = TRUE, repos = "http://cran.us.r-project.org")

install.packages("pdftools",  dependencies = TRUE, repos = "http://cran.us.r-project.org")

 

Note: The Rcpp package is a dependency and is not necessary but I use it to prevent issues that occur with other R GUI's.

 

Now define your data input (The FilePath to your pdf found using the directory tool)

 

data <- read.Alteryx("#1", mode="data.frame")

 

Finally change the format of your data:

         1                2            3           4          5  $    6            7

write.Alteryx(pdftools::pdf_text(file.path(data$FullPath)), 1)

 

Breakdown of the code:

1 & 7 = Alteryx specific R code that defines the output

2 = calls the package we will be using

3 = the command that will convert the pdf to text

4 = used to reformat the cell in our data frame as a file path

5 = the data frame we defined earlier

$ = print

6 = the field name of the cell from the directory tool

 

There it is a very simple solution that allows us to convert pdf to a usable format with in Alteryx.

 

MD2050
8 - Asteroid

Gm Everyone-

I am a new Alteryx user and have been having some issues while processing the Workflow attached by the author. after converting the doctotext tar file , i ran the doctotext.exe form the command shell , which was executed successfully. After which i downloaded the workflow -> under directory for input data entered the path for the destination folder where the pdf's are stored -> under file specification entered the name of the specific pdf i need to parse and after clicking the execute button , i still see an error at the run command . Am i missing any step or process ? I will really appreciate if i can get some direction.Thank you,

 

Capture.PNG

vkarthik21
8 - Asteroid

I am also erroring out at the same stage that MD2050 as specified above. Where can we get that Runbat.bat file?

daniel_mmi
9 - Comet

In my case I'm generating that .bat file programmatically, since it's running on several hundred PDF files at once. Here's a sample of the output, though. It might just be an issue with the formatting of your request:

 


".\exe\doctotext.exe" --pdf ".\input\XXXXX Invoices 56040-57209 0717 123.pdf" >>"XXXXX Invoices 56040-57209 0717 123.pdf.txt"
".\exe\doctotext.exe" --pdf ".\input\XXXXX Invoices 56040-57209 0717 124.pdf" >>"XXXXX Invoices 56040-57209 0717 124.pdf.txt"
".\exe\doctotext.exe" --pdf ".\input\XXXXX Invoices 56040-57209 0717 125.pdf" >>"XXXXX Invoices 56040-57209 0717 125.pdf.txt"

vkarthik21
8 - Asteroid

Thanks Daniel for pointing that out. Seeing your post I had a relook at the transformations within the workflow and I could then see the bat content that is getting populated in the file. I manually ran the bat file and got the output required.

 

I am still not sure how to fix the error in alteryx but will figure it out. 

Levin
6 - Meteoroid

Hi All,

 

How would I adjust this for .docm?

 

Thanks

Jchantnicki
7 - Meteor
Very cool solution to reading PDF files. I ran into an issue when I tried to replicate the results into a different folder. When I downloaded your workflow, everything went fine after I changed the directory and I received a txt file from the sample resume you provided in the download. However, when I copied and pasted the workflow along with the exe folder to a new folder, I would receive an error stating the file cannot be found even though I updated the directories in the worfklow for the doctotext.exe along with the pdf directory. Any thoughts?
JasmeenCheema
5 - Atom

Hi All,

 

Thanks for sharing the workflow and indeed its great.

 

However, I am not able to run the entire workflow as I am getting below error.

I guess we are missing the .exe file due to which this issue is coming.

Will be great if anyone amongst us can share the required .exe /Other solution .

Workflow Issue.PNG

Hoping for a helping hand.

Thanks,

Jasmeen

 

 

jhasid
5 - Atom

Getting issue while processing some PDF as

 

Using PDF parser.
Error parsing file. Backtrace:
Cannot decode stream: number of filters does not match the number of decoding parameters
Error while loading stream data at offset 271 and size 1623
Error while parsing page number: 0
It is possible that wrong parser was selected. Trying different parsers.
Trying to detect document format by its content.
Creating parser failed.
Cannot unzip file.
Cannot unzip file.
Using PDF parser.
Error parsing file. Backtrace:
Cannot decode stream: number of filters does not match the number of decoding parameters
Error while loading stream data at offset 271 and size 1623
Error while parsing page number: 0
Error processing file Wo_Revised_Summary.pdf.

JeremyL
Alteryx Alumni (Retired)

Hey there.  Just wanted to let brave souls know that I created a simple Alteryx tool using the Python SDK to parse PDF data.  If you're interested in taking it for a spin, have a look:
https://community.alteryx.com/t5/Alteryx-Server-Discussions/Alteryx-PDF-to-Text-Tool-Beta/m-p/301296...

Thanks all.

Sushantmutreja1
7 - Meteor

This is really nice, I have saved many hours using attached workflow.

Thanks for sharing.

Data1
5 - Atom

Chad, thanks for sharing this solution! Like some of the other users, I too am getting an error code 1 at the run command tool. Not sure what you mean by running in the command prompt?  Thanks much!!!

 

Jacob

anas_10
5 - Atom

@Daniel_MMI

How exactly did you add the 'Run Command' Batch ?

Can you please let me know?

daniel_mmi
9 - Comet

See the attached images.

 

Workflow: We're taking the first record from the incoming data. Nothing from this record will be used, just using it as a vessel for the formula tool that follows. That formula tool has the text that I need to send to the command line, in this case deleting the temp pdf and txt files. The command tool took some figuring, but what it's doing is: writing the text from that formula tool into 'clear_when_finished.bat' (the top part of the configuration screen), then executing that .bat file using the 'Run External Program' option in the second section. This method will work for anything that can be done from the command line, using those two steps of: 1) write the command to a .bat file, and 2) execute that .bat file as an 'external program'.

 

Workflow.PNG

Command.pngBatch.png

anas_10
5 - Atom

@Daniel_MMI

 

This was immensely helpful. Thanks a lot!

AlteryxUserFL
11 - Bolide

@Daniel_MMI I'm wondering how you parsed and structured the necessary data from the text for the invoices you mentioned . I was able to pull the text from PDF invoices, find PO #s, etc. however, i'm struggling to structure the less identifiable data such as vendor name. Would you mind sharing how you successfully pulled out the required information from invoices such as vendor name, invoice date, invoice number, etc? 

CailinS
Alteryx
Alteryx

@AlteryxUserFL I figured I'd chime in and mention Regular Expressions (accessible in the Formula tools as well as with click-and-build menus in the RegEx tool). If you are familiar great and you may have already tried this with no luck, if not...I find that any time I have unstructured data this is often the place to start (if not more simple parsing with the Text to Columns tool first). 

 

My pro-tip to answer your question about the less obvious/consistent patterns: identify the most clear patterns first, and then what falls 'in between' can have a more flexible/vague logic - like the values for vendor name! I like the site regexr.com for practicing building the patterns and testing my data. For instance you could build out a pattern to recognize vendor name, invoice date, and invoice number like this: ((?:\w\s*)+)(\d\d\d\d-\d\d-\d\d)(\d{7}) which might represent 'some word(s) maybe followed by a space, then 4 digit year-2 digit month-2 digit day, then 7 digits...your patterns would obviously be different.

daniel_mmi
9 - Comet

As someone who is *not* intimately conversant with regular expressions, I used a more heuristic approach, based on recognizing patterns within the resulting data. The first time through, for example, I noticed that Vendor Name was always on line 7. Maybe vendor phone was on line 9, and required a formula to clean up a text string like: "Invoice payable                                                                   555-123-4567" 

There ended up being several possible configurations, usually based on the # of lines in each invoice description, so over the course of a couple runs, I was able to iron out all the strange exceptions. Below is an example of the sorts of transformations and selections I used: 

 

Keep in mind, this was one of the first complex workflows I built, so there's a good bit of 'brute force' methodology involved. Fortunately, the scope of the incoming data was fairly modest (if not consistent), so once I had it working, it didn't require much maintenance, and any errors stood out pretty clearly. 

Canvas.PNGTranslation.pngExceptions.pngAmt Due.png

CailinS
Alteryx
Alteryx

@daniel_mmi nice! looks effective and to your point, low maintenance.  

daniel_mmi
9 - Comet

@CailinS Thanks! It ran biweekly for almost a year, and by my count saved my accounting department something like 250 hours of especially dull manual labor. It's always good to have accounting on your side!

AlteryxUserFL
11 - Bolide

Thank you for the information. My data would vary a lot as the invoices would come from 1000's of vendors. Is there anyway to do a reverse regular expression (kinda of like machine learning) where you input the data and also the desired  result for several hundred lines of data and then it returns the regular expression needed to pull that data? 

 

NeilR
Alteryx Alumni (Retired)

@AlteryxUserFL If you had enough input data with desired result, you might be able to use machine learning to generate the desired outcome, negating the need to reverse engineer a regular expression.

AlteryxUserFL
11 - Bolide

@NeilR does Alterxy have any machine learning capabilities or would I need a different platform?    

NeilR
Alteryx Alumni (Retired)

Alteryx does have ML tools, but this particular use case would likely require some custom R or Python code. If you're able to share data it could be a fun project for the community to help you tackle. It would likely require several thousand example records to train a decent model.

AlteryxUserFL
11 - Bolide

Helo NeilR, making this a community project would be great, however, I don't think I would be able to get authorization to upload 1000's of our invoices, or invoice data,  to the web. I have applied for the beta program as those new tools look like they would help me solve this solution. Basically I plan to convert PDF invoices to text, and then build new columns with the fields I want to capture from each PDF text cell. Using those new tools, I should be able to run the sample data through and it looks like it can help build a formula for a process like this.