community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx Knowledge Base

Definitive answers from Designer experts.

Can Alteryx Parse A Word Doc Or PDF?

Alteryx Alumni (Retired)
Created on

One of the biggest reasons why people love Alteryx is that it has the ability to read a very large number of different data sources.  One limitation is that it cannot read in a pdf or word doc without a little help from another source.  Why would someone want to do this?  Well, one excellent example would be to parse a folder full of resumes to search for specific text. 

Why can't Alteryx read them natively? These file types are not standard data formats, so in order to read them, we must first convert them to a plain text file.  To convert, there is a free, open-source program called DocToText. This program can be run at the command line to convert these file types to plain text, which Alteryx can read with no issue.

I've included an example attached to this post.  This workflow utilizes an often underused tool, the Run Command tool.  With the help of this tool, we can read in a list of files from a specific source folder, parse the info into something DocToText can use, then use the RunCmd Tool to convert all files to plain text for further consumption.  I've included everything you will need in the attachment (including a folder structure that works well with the module). 

Download and extract the attached .yxzp file, check out the module, and let us know what you think! This example has been updated for version 10.0. You'll notice the package will produce a couple of dependency errors when you extract it. That's ok, it won't error on run.

Comments
Bolide

 What about pdf's with images?

Asteroid

This is a really cool concept, I was just asked a Client this question and this is a great starting point, thanks Chad

This solution worked perfectly for data that is pulled from an instrument and the manufacturer only offers pdf as the output. Thanks for sharing!

Thanks for this!

But how does the EXE works in getting the TXT from PDF files?

Fantastic.

 

I was just handed a .zip file that was supposed to contain invoice data from one of our vendors, to be imported into our accounting system. Turns out, it was a folder full of 128 PDF files. In about 30 minutes of fiddling, I had all the text extracted into one file, and then after another hour or so I had parsed, structured data, one line per invoice. As the invoices are currently being manually keyed in by the accounting staff, this should help me make some quick friends.

 

One quirk of the above workflow: If you run it more than once, it appends the data to the temp files, rather than replacing them. I just added a 'Run Command' batch to 'Clear When Finished' (DEL *.pdf.txt), and now it runs perfectly, every time.

 

Thanks! You guys rock.

Berchalyn: http://sourceforge.net/projects/doctotext/

 

A quick thank you to SilverCoders, too, makers of a variety of free, useful, open source tools.

Alteryx
Alteryx

I have also used the pdftotext.exe free program to do the intial conversion (I have seen it give results when the doctotext does not).

Alteryx Partner

I believe this capability to be iincluded in the next release,

Every client I've come accross so far has some issue with reading semi-structured data from PDF files...

 

Best

Atom

 

Thanks for this tool and I'm looking forward to using it. I'm receiving error code 1 at the runbat.bat "run command" tool. Do you have any advice on how to solve this?

 

Thanks,

 

Brad

Alteryx
Alteryx

Hi Brad,

 

Try running the batch file from the command prompt. The error that you are getting is basically just saying that the batch command didn't execute properly.

 

Kane

Atom

I'm getting the "error code 1" message for some of the PDF's that I run it on as well. When I run it directly through the command prompt this is the error that it returns:

Using PDF parser.
Error parsing file. Backtrace:
PDF Stream iterator: invalid array
Error while parsing page number: 0
It is possible that wrong parser was selected. Trying different parsers.
Trying to detect document format by its content.
Creating parser failed.
Cannot unzip file.
Cannot unzip file.
Using PDF parser.
Error parsing file. Backtrace:
PDF Stream iterator: invalid array
Error while parsing page number: 0
Error processing file C:\Conversions\Alteryx\PDFParse\1234-abcd_efgh_stmt.pdf.

 

It only seems to happen for about a 1/3 of the PDF's that i'm trying to parse. Has anyone encountered this by chance?

Meteor

Hello,

 

Is there a way to output to a pdf?

 

Thank you.

Susana

Magnetar

Hi @Su -- yes, you can use the "Render" tool to write a PDF.  You will need to layout the PDF first, using various provided reporting tools to set up tables and arrange layouts as desired.  This can be as easy as one "Table" tool selecting all columns and feeding directly into the "Render" tool, but if desired there are several additional tools for laying out multiple tables, adding headers or footers, and etc...

 

Meteor

Hi @JohnJPS , thanks a lot !

Alteryx Certified Partner
Hi all. Is there a native Alteryx PDF parser yet? I'm somewhat a novice at this and not sure if one came out with the current release as mentioned earlier. Thanks.
Alteryx
Alteryx

There is not a native PDF parser at this time though hopefully the discussion and example above help!

Creative Director
Creative Director

If you want to see this natively supported in a future release, please submit as an idea here:

http://community.alteryx.com/t5/Alteryx-Product-Ideas/idb-p/product-ideas

 

 

Meteoroid

What version will this be available?

While we wait for native PDF parsing in Alteryx, another 3rd party application folks may want to try is Tabula.

 

http://tabula.technology/

 

Tabula is free and open-source. The core functionality runs in Java with a web browser front-end for user interaction. It supports both auto-detect and manual selection of table elements, extraction preview, bulk extraction, and multiple output options. I tested it with an 11 MB, 420-page PDF and was quite pleased with how well it handled the beast. Obviously with any PDF data extraction, there is a certain amount of manual effort that will be unavoidable. But I felt that Tabula did a great job of automating as much as it could, and giving me robust control of the process from there.

 

Hope this helps some!

Asteroid

I've run into the same issue as mpate.  I'm not sure if it's due to the PDF being a scanned paper document.  If this is the problem, is there a solution or another app I can use or way I can work it to bring in these PDFs and parse them out to csv or xlxs?

 

Any ideas would be welcome!

 

Thanks!

Brad

Brad,

 

If the PDF is a scanned paper document, that means that it's really just an image in a PDF wrapper. Consequently, you need to apply OCR software (Optical Character Recognition) to the document. Here are some options. (Disclaimer: I have not tested any of these approaches.)

 

I hope these resources help you build a workflow that works well for you.

Bolide

 

 

Yeah, OCR is next best thing. Doesn't Office have a native OCR tool? Just make sure your scan is high res or your text will be jibberish (e.g. your B's will be 8's).

 

Simon

Creative Director
Creative Director
Asteroid

Thanks for all the tips!

 

Brad

Alteryx Certified Partner

This utility strips off leading white spaces. This becomes a problem when my first column is empty. Any configuration to adjust?

This has been an ongoing issue that has been a frequent issue for myself.  Now why I typically go-to Alteryx when needing BI solutions is to build out complete autonomous solutions, and the problems I have faced with the solutions found in most of the forums and blogs is that they don't work as well as I would like within Alteryx without requiring some form of human interaction.

 

This is why I came up with a fairly simple and easily automated solution using the R tool and a relatively new package in R called pdftools.  The code is strait forword and the beauty is in the simplicity.  For automation purposes on server the first thing you will want to do is set the working directory, the line of code is easy:

 

setwd("UNC FilePath")

 

Then add a line of code to install packages: (Don't worry so far I haven't came across issues of duping)

 

install.packages("Rcpp",  dependencies = TRUE, repos = "http://cran.us.r-project.org")

install.packages("pdftools",  dependencies = TRUE, repos = "http://cran.us.r-project.org")

 

Note: The Rcpp package is a dependency and is not necessary but I use it to prevent issues that occur with other R GUI's.

 

Now define your data input (The FilePath to your pdf found using the directory tool)

 

data <- read.Alteryx("#1", mode="data.frame")

 

Finally change the format of your data:

         1                2            3           4          5  $    6            7

write.Alteryx(pdftools:Smiley Tonguedf_text(file.path(data$FullPath)), 1)

 

Breakdown of the code:

1 & 7 = Alteryx specific R code that defines the output

2 = calls the package we will be using

3 = the command that will convert the pdf to text

4 = used to reformat the cell in our data frame as a file path

5 = the data frame we defined earlier

$ = print

6 = the field name of the cell from the directory tool

 

There it is a very simple solution that allows us to convert pdf to a usable format with in Alteryx.

 

Asteroid

Gm Everyone-

I am a new Alteryx user and have been having some issues while processing the Workflow attached by the author. after converting the doctotext tar file , i ran the doctotext.exe form the command shell , which was executed successfully. After which i downloaded the workflow -> under directory for input data entered the path for the destination folder where the pdf's are stored -> under file specification entered the name of the specific pdf i need to parse and after clicking the execute button , i still see an error at the run command . Am i missing any step or process ? I will really appreciate if i can get some direction.Thank you,

 

Capture.PNG

Asteroid

I am also erroring out at the same stage that MD2050 as specified above. Where can we get that Runbat.bat file?

In my case I'm generating that .bat file programmatically, since it's running on several hundred PDF files at once. Here's a sample of the output, though. It might just be an issue with the formatting of your request:

 


".\exe\doctotext.exe" --pdf ".\input\XXXXX Invoices 56040-57209 0717 123.pdf" >>"XXXXX Invoices 56040-57209 0717 123.pdf.txt"
".\exe\doctotext.exe" --pdf ".\input\XXXXX Invoices 56040-57209 0717 124.pdf" >>"XXXXX Invoices 56040-57209 0717 124.pdf.txt"
".\exe\doctotext.exe" --pdf ".\input\XXXXX Invoices 56040-57209 0717 125.pdf" >>"XXXXX Invoices 56040-57209 0717 125.pdf.txt"

Asteroid

Thanks Daniel for pointing that out. Seeing your post I had a relook at the transformations within the workflow and I could then see the bat content that is getting populated in the file. I manually ran the bat file and got the output required.

 

I am still not sure how to fix the error in alteryx but will figure it out. 

Meteoroid

Hi All,

 

How would I adjust this for .docm?

 

Thanks

Meteor
Very cool solution to reading PDF files. I ran into an issue when I tried to replicate the results into a different folder. When I downloaded your workflow, everything went fine after I changed the directory and I received a txt file from the sample resume you provided in the download. However, when I copied and pasted the workflow along with the exe folder to a new folder, I would receive an error stating the file cannot be found even though I updated the directories in the worfklow for the doctotext.exe along with the pdf directory. Any thoughts?

Hi All,

 

Thanks for sharing the workflow and indeed its great.

 

However, I am not able to run the entire workflow as I am getting below error.

I guess we are missing the .exe file due to which this issue is coming.

Will be great if anyone amongst us can share the required .exe /Other solution .

Workflow Issue.PNG

Hoping for a helping hand.

Thanks,

Jasmeen

 

 

Atom

Getting issue while processing some PDF as

 

Using PDF parser.
Error parsing file. Backtrace:
Cannot decode stream: number of filters does not match the number of decoding parameters
Error while loading stream data at offset 271 and size 1623
Error while parsing page number: 0
It is possible that wrong parser was selected. Trying different parsers.
Trying to detect document format by its content.
Creating parser failed.
Cannot unzip file.
Cannot unzip file.
Using PDF parser.
Error parsing file. Backtrace:
Cannot decode stream: number of filters does not match the number of decoding parameters
Error while loading stream data at offset 271 and size 1623
Error while parsing page number: 0
Error processing file Wo_Revised_Summary.pdf.

Alteryx
Alteryx

Hey there.  Just wanted to let brave souls know that I created a simple Alteryx tool using the Python SDK to parse PDF data.  If you're interested in taking it for a spin, have a look:
https://community.alteryx.com/t5/Alteryx-Server-Discussions/Alteryx-PDF-to-Text-Tool-Beta/m-p/301296...

Thanks all.

This is really nice, I have saved many hours using attached workflow.

Thanks for sharing.

Atom

Chad, thanks for sharing this solution! Like some of the other users, I too am getting an error code 1 at the run command tool. Not sure what you mean by running in the command prompt?  Thanks much!!!

 

Jacob

Atom

@Daniel_MMI

How exactly did you add the 'Run Command' Batch ?

Can you please let me know?