This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
on 08-25-2014 11:22 PM - edited on 07-27-2021 11:47 PM by APIUserOpsDM
One of the biggest reasons why people love Alteryx is that it has the ability to read a very large number of different data sources. One limitation is that it cannot read in a pdf or word doc without a little help from another source. Why would someone want to do this? Well, one excellent example would be to parse a folder full of resumes to search for specific text.
Why can't Alteryx read them natively? These file types are not standard data formats, so in order to read them, we must first convert them to a plain text file. To convert, there is a free, open-source program called DocToText. This program can be run at the command line to convert these file types to plain text, which Alteryx can read with no issue.
I've included an example attached to this post. This workflow utilizes an often underused tool, the Run Command tool. With the help of this tool, we can read in a list of files from a specific source folder, parse the info into something DocToText can use, then use the RunCmd Tool to convert all files to plain text for further consumption. I've included everything you will need in the attachment (including a folder structure that works well with the module).
Download and extract the attached .yxzp file, check out the module, and let us know what you think! This example has been updated for version 10.0. You'll notice the package will produce a couple of dependency errors when you extract it. That's ok, it won't error on run.
What about pdf's with images?
This is a really cool concept, I was just asked a Client this question and this is a great starting point, thanks Chad
This solution worked perfectly for data that is pulled from an instrument and the manufacturer only offers pdf as the output. Thanks for sharing!
Thanks for this!
But how does the EXE works in getting the TXT from PDF files?
Fantastic.
I was just handed a .zip file that was supposed to contain invoice data from one of our vendors, to be imported into our accounting system. Turns out, it was a folder full of 128 PDF files. In about 30 minutes of fiddling, I had all the text extracted into one file, and then after another hour or so I had parsed, structured data, one line per invoice. As the invoices are currently being manually keyed in by the accounting staff, this should help me make some quick friends.
One quirk of the above workflow: If you run it more than once, it appends the data to the temp files, rather than replacing them. I just added a 'Run Command' batch to 'Clear When Finished' (DEL *.pdf.txt), and now it runs perfectly, every time.
Thanks! You guys rock.
Berchalyn: http://sourceforge.net/projects/doctotext/
A quick thank you to SilverCoders, too, makers of a variety of free, useful, open source tools.
I have also used the pdftotext.exe free program to do the intial conversion (I have seen it give results when the doctotext does not).
I believe this capability to be iincluded in the next release,
Every client I've come accross so far has some issue with reading semi-structured data from PDF files...
Best
Thanks for this tool and I'm looking forward to using it. I'm receiving error code 1 at the runbat.bat "run command" tool. Do you have any advice on how to solve this?
Thanks,
Brad
Hi Brad,
Try running the batch file from the command prompt. The error that you are getting is basically just saying that the batch command didn't execute properly.
Kane
I'm getting the "error code 1" message for some of the PDF's that I run it on as well. When I run it directly through the command prompt this is the error that it returns:
Using PDF parser.
Error parsing file. Backtrace:
PDF Stream iterator: invalid array
Error while parsing page number: 0
It is possible that wrong parser was selected. Trying different parsers.
Trying to detect document format by its content.
Creating parser failed.
Cannot unzip file.
Cannot unzip file.
Using PDF parser.
Error parsing file. Backtrace:
PDF Stream iterator: invalid array
Error while parsing page number: 0
Error processing file C:\Conversions\Alteryx\PDFParse\1234-abcd_efgh_stmt.pdf.
It only seems to happen for about a 1/3 of the PDF's that i'm trying to parse. Has anyone encountered this by chance?
Hello,
Is there a way to output to a pdf?
Thank you.
Susana
Hi @Su -- yes, you can use the "Render" tool to write a PDF. You will need to layout the PDF first, using various provided reporting tools to set up tables and arrange layouts as desired. This can be as easy as one "Table" tool selecting all columns and feeding directly into the "Render" tool, but if desired there are several additional tools for laying out multiple tables, adding headers or footers, and etc...
Hi @JohnJPS , thanks a lot !
There is not a native PDF parser at this time though hopefully the discussion and example above help!
If you want to see this natively supported in a future release, please submit as an idea here:
http://community.alteryx.com/t5/Alteryx-Product-Ideas/idb-p/product-ideas
What version will this be available?
While we wait for native PDF parsing in Alteryx, another 3rd party application folks may want to try is Tabula.
Tabula is free and open-source. The core functionality runs in Java with a web browser front-end for user interaction. It supports both auto-detect and manual selection of table elements, extraction preview, bulk extraction, and multiple output options. I tested it with an 11 MB, 420-page PDF and was quite pleased with how well it handled the beast. Obviously with any PDF data extraction, there is a certain amount of manual effort that will be unavoidable. But I felt that Tabula did a great job of automating as much as it could, and giving me robust control of the process from there.
Hope this helps some!
I've run into the same issue as mpate. I'm not sure if it's due to the PDF being a scanned paper document. If this is the problem, is there a solution or another app I can use or way I can work it to bring in these PDFs and parse them out to csv or xlxs?
Any ideas would be welcome!
Thanks!
Brad
Brad,
If the PDF is a scanned paper document, that means that it's really just an image in a PDF wrapper. Consequently, you need to apply OCR software (Optical Character Recognition) to the document. Here are some options. (Disclaimer: I have not tested any of these approaches.)
I hope these resources help you build a workflow that works well for you.
Yeah, OCR is next best thing. Doesn't Office have a native OCR tool? Just make sure your scan is high res or your text will be jibberish (e.g. your B's will be 8's).
Simon
One Note has OCR
http://www.thewindowsclub.com/onenote-extract-text-from-image
Thanks for all the tips!
Brad
This utility strips off leading white spaces. This becomes a problem when my first column is empty. Any configuration to adjust?
This has been an ongoing issue that has been a frequent issue for myself. Now why I typically go-to Alteryx when needing BI solutions is to build out complete autonomous solutions, and the problems I have faced with the solutions found in most of the forums and blogs is that they don't work as well as I would like within Alteryx without requiring some form of human interaction.
This is why I came up with a fairly simple and easily automated solution using the R tool and a relatively new package in R called pdftools. The code is strait forword and the beauty is in the simplicity. For automation purposes on server the first thing you will want to do is set the working directory, the line of code is easy:
setwd("UNC FilePath")
Then add a line of code to install packages: (Don't worry so far I haven't came across issues of duping)
install.packages("Rcpp", dependencies = TRUE, repos = "http://cran.us.r-project.org")
install.packages("pdftools", dependencies = TRUE, repos = "http://cran.us.r-project.org")
Note: The Rcpp package is a dependency and is not necessary but I use it to prevent issues that occur with other R GUI's.
Now define your data input (The FilePath to your pdf found using the directory tool)
data <- read.Alteryx("#1", mode="data.frame")
Finally change the format of your data:
1 2 3 4 5 $ 6 7
write.Alteryx(pdftools::pdf_text(file.path(data$FullPath)), 1)
Breakdown of the code:
1 & 7 = Alteryx specific R code that defines the output
2 = calls the package we will be using
3 = the command that will convert the pdf to text
4 = used to reformat the cell in our data frame as a file path
5 = the data frame we defined earlier
6 = the field name of the cell from the directory tool
There it is a very simple solution that allows us to convert pdf to a usable format with in Alteryx.
Gm Everyone-
I am a new Alteryx user and have been having some issues while processing the Workflow attached by the author. after converting the doctotext tar file , i ran the doctotext.exe form the command shell , which was executed successfully. After which i downloaded the workflow -> under directory for input data entered the path for the destination folder where the pdf's are stored -> under file specification entered the name of the specific pdf i need to parse and after clicking the execute button , i still see an error at the run command . Am i missing any step or process ? I will really appreciate if i can get some direction.Thank you,
I am also erroring out at the same stage that MD2050 as specified above. Where can we get that Runbat.bat file?
In my case I'm generating that .bat file programmatically, since it's running on several hundred PDF files at once. Here's a sample of the output, though. It might just be an issue with the formatting of your request:
".\exe\doctotext.exe" --pdf ".\input\XXXXX Invoices 56040-57209 0717 123.pdf" >>"XXXXX Invoices 56040-57209 0717 123.pdf.txt"
".\exe\doctotext.exe" --pdf ".\input\XXXXX Invoices 56040-57209 0717 124.pdf" >>"XXXXX Invoices 56040-57209 0717 124.pdf.txt"
".\exe\doctotext.exe" --pdf ".\input\XXXXX Invoices 56040-57209 0717 125.pdf" >>"XXXXX Invoices 56040-57209 0717 125.pdf.txt"
Thanks Daniel for pointing that out. Seeing your post I had a relook at the transformations within the workflow and I could then see the bat content that is getting populated in the file. I manually ran the bat file and got the output required.
I am still not sure how to fix the error in alteryx but will figure it out.
Hi All,
How would I adjust this for .docm?
Thanks
Hi All,
Thanks for sharing the workflow and indeed its great.
However, I am not able to run the entire workflow as I am getting below error.
I guess we are missing the .exe file due to which this issue is coming.
Will be great if anyone amongst us can share the required .exe /Other solution .
Hoping for a helping hand.
Thanks,
Jasmeen
Getting issue while processing some PDF as
Using PDF parser.
Error parsing file. Backtrace:
Cannot decode stream: number of filters does not match the number of decoding parameters
Error while loading stream data at offset 271 and size 1623
Error while parsing page number: 0
It is possible that wrong parser was selected. Trying different parsers.
Trying to detect document format by its content.
Creating parser failed.
Cannot unzip file.
Cannot unzip file.
Using PDF parser.
Error parsing file. Backtrace:
Cannot decode stream: number of filters does not match the number of decoding parameters
Error while loading stream data at offset 271 and size 1623
Error while parsing page number: 0
Error processing file Wo_Revised_Summary.pdf.
Hey there. Just wanted to let brave souls know that I created a simple Alteryx tool using the Python SDK to parse PDF data. If you're interested in taking it for a spin, have a look:
https://community.alteryx.com/t5/Alteryx-Server-Discussions/Alteryx-PDF-to-Text-Tool-Beta/m-p/301296...
Thanks all.
This is really nice, I have saved many hours using attached workflow.
Thanks for sharing.
Chad, thanks for sharing this solution! Like some of the other users, I too am getting an error code 1 at the run command tool. Not sure what you mean by running in the command prompt? Thanks much!!!
Jacob
@Daniel_MMI
How exactly did you add the 'Run Command' Batch ?
Can you please let me know?
See the attached images.
Workflow: We're taking the first record from the incoming data. Nothing from this record will be used, just using it as a vessel for the formula tool that follows. That formula tool has the text that I need to send to the command line, in this case deleting the temp pdf and txt files. The command tool took some figuring, but what it's doing is: writing the text from that formula tool into 'clear_when_finished.bat' (the top part of the configuration screen), then executing that .bat file using the 'Run External Program' option in the second section. This method will work for anything that can be done from the command line, using those two steps of: 1) write the command to a .bat file, and 2) execute that .bat file as an 'external program'.
@Daniel_MMI
This was immensely helpful. Thanks a lot!
@Daniel_MMI I'm wondering how you parsed and structured the necessary data from the text for the invoices you mentioned . I was able to pull the text from PDF invoices, find PO #s, etc. however, i'm struggling to structure the less identifiable data such as vendor name. Would you mind sharing how you successfully pulled out the required information from invoices such as vendor name, invoice date, invoice number, etc?
@AlteryxUserFL I figured I'd chime in and mention Regular Expressions (accessible in the Formula tools as well as with click-and-build menus in the RegEx tool). If you are familiar great and you may have already tried this with no luck, if not...I find that any time I have unstructured data this is often the place to start (if not more simple parsing with the Text to Columns tool first).
My pro-tip to answer your question about the less obvious/consistent patterns: identify the most clear patterns first, and then what falls 'in between' can have a more flexible/vague logic - like the values for vendor name! I like the site regexr.com for practicing building the patterns and testing my data. For instance you could build out a pattern to recognize vendor name, invoice date, and invoice number like this: ((?:\w\s*)+)(\d\d\d\d-\d\d-\d\d)(\d{7}) which might represent 'some word(s) maybe followed by a space, then 4 digit year-2 digit month-2 digit day, then 7 digits...your patterns would obviously be different.
As someone who is *not* intimately conversant with regular expressions, I used a more heuristic approach, based on recognizing patterns within the resulting data. The first time through, for example, I noticed that Vendor Name was always on line 7. Maybe vendor phone was on line 9, and required a formula to clean up a text string like: "Invoice payable 555-123-4567"
There ended up being several possible configurations, usually based on the # of lines in each invoice description, so over the course of a couple runs, I was able to iron out all the strange exceptions. Below is an example of the sorts of transformations and selections I used:
Keep in mind, this was one of the first complex workflows I built, so there's a good bit of 'brute force' methodology involved. Fortunately, the scope of the incoming data was fairly modest (if not consistent), so once I had it working, it didn't require much maintenance, and any errors stood out pretty clearly.
@daniel_mmi nice! looks effective and to your point, low maintenance.
@CailinS Thanks! It ran biweekly for almost a year, and by my count saved my accounting department something like 250 hours of especially dull manual labor. It's always good to have accounting on your side!
Thank you for the information. My data would vary a lot as the invoices would come from 1000's of vendors. Is there anyway to do a reverse regular expression (kinda of like machine learning) where you input the data and also the desired result for several hundred lines of data and then it returns the regular expression needed to pull that data?
@AlteryxUserFL If you had enough input data with desired result, you might be able to use machine learning to generate the desired outcome, negating the need to reverse engineer a regular expression.
@NeilR does Alterxy have any machine learning capabilities or would I need a different platform?
Helo NeilR, making this a community project would be great, however, I don't think I would be able to get authorization to upload 1000's of our invoices, or invoice data, to the web. I have applied for the beta program as those new tools look like they would help me solve this solution. Basically I plan to convert PDF invoices to text, and then build new columns with the fields I want to capture from each PDF text cell. Using those new tools, I should be able to run the sample data through and it looks like it can help build a formula for a process like this.