community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx Designer Knowledge Base

Definitive answers from Designer experts.
#SANTALYTICS

The highly anticipated Alteryx Community tradition is back! We hope you'll join us!

Learn More
We will be upgrading the Gallery to our 2019.4 release this Saturday, December 7th beginning at 9:00am MT. We expect the outage to take last approx. 2.5 hours.

Can Alteryx Parse A Word Doc Or PDF?

Alteryx Alumni (Retired)
Created on

One of the biggest reasons why people love Alteryx is that it has the ability to read a very large number of different data sources.  One limitation is that it cannot read in a pdf or word doc without a little help from another source.  Why would someone want to do this?  Well, one excellent example would be to parse a folder full of resumes to search for specific text. 

Why can't Alteryx read them natively? These file types are not standard data formats, so in order to read them, we must first convert them to a plain text file.  To convert, there is a free, open-source program called DocToText. This program can be run at the command line to convert these file types to plain text, which Alteryx can read with no issue.

I've included an example attached to this post.  This workflow utilizes an often underused tool, the Run Command tool.  With the help of this tool, we can read in a list of files from a specific source folder, parse the info into something DocToText can use, then use the RunCmd Tool to convert all files to plain text for further consumption.  I've included everything you will need in the attachment (including a folder structure that works well with the module). 

Download and extract the attached .yxzp file, check out the module, and let us know what you think! This example has been updated for version 10.0. You'll notice the package will produce a couple of dependency errors when you extract it. That's ok, it won't error on run.

Attachments
Comments

This sounds like a pretty complicated concept, if you don't have any advance knowledge of the format for each invoice (much less how they'll look as extracted text). Most methods I could find involved some sort of template identification, each with associated parameters that would tell the software how to parse the data. You're also going to be looking at structured (if you're fortunate), unstructured and semi-structured data. Each of those has a different approach, and raises more questions: are you capturing one row per invoice or one per line item? How will you connect the line item data with data shared across the invoice or vendor? 

 

If you have some kind of CASS address parsing/identification software, you could take what you infer to be the address block, run each pair of lines through the parser, identify the postal address lines, then take the line(s) above those as the vendor name or contact. In the case of my initial line-number example, you could have a translation table that includes a column associated with the vendor. It's not in my screenshot, but I also have an offshoot that exports records that don't pass muster. As long as there are few of those each time I nudge the field ID scheme, it's a finite amount of time before I have them all sorted. Obviously that isn't really scalable, but it gets into that near-enterprise range; if I have to manually intervene a few hundred times, on one occasion, it's worth it for the time saved. That approach doesn't work so well with hundreds/thousands of new vendors per billing period. 

 

One place Alteryx shines, in my opinion, is in the context of the mostly-automated workflow. Sort of like dealing with real-time data. Using recent cached data saves a lot of overhead, and is often just fine. The same can be said of automated processes. If you can automate even 80% of a process that was not initially possible, it's a huge benefit in fairly short order, and you can work on reaching 90% or 95% once your (sometimes clunky) process is up and running, and you build more familiarity with the data. 

 

There are a few python text-parsing tools, and it's possible one or more could be useful adapted to the Alteryx R module (fuzzy address mapping, maybe?), but most will put you back to the question of whether you should treat the data as coming from a finite (but expandable) set of vendors or an indeterminate set of vendors. 

Asteroid

The text is coming in as one row (string) per invoice in Alterxy. At this point, i'm not trying to identify line level spend info from the invoice.... just header level information such as supplier name, invoice number, and date. We have a semi-automated invoice process that we pay for, however, there have been reports of lost invoices. My goal is to scan all the PDF invoices we have in the email box to determine how many unique (based on invoice number) invoices we are getting per month, per vendor to check for missing invoices. We get ~500 invoices a day from various vendors. I have a large number of invoices that have been saved with the vendor name (Exactly as it appears in the PDF)  as the file name.  If my beta access request is approved, i'm going  to run that data through the new Alterxy Machine Learning tools. 

Unfortunately, I am still seeing the error below when running the macro through Alteryx. The only change made to the macro was a change in folder. The error appears for all pdf files that I present to it, including some that are not scans of hard copies and even a test save-to-pdf from Excel (2nd image).

 

Error.PNG

 

 

When running the macro directly in the Command Prompt, the error above disappears, however, the resulting .pdf.txt files are all empty (0kb in sample below). 

 

cutepdfview.PNG

 

The only file that has successfully scanned and loaded content to the output .pdf.txt file is the sample resume provided in the download above. Would this issue of empty files be tied to the type of target pdf file?

 

Atom

Hello All,

 

I am also receiving the Error Code:1 when switching the folder directory. Does anybody know what we need to reconfigure to fix the workflow? I tried copying the container folder contents to my folder but I am still receiving this error. I tried to manually run the bat file but it just comes back with a blank notes file. Any help would be much appreciated. 

It's possible you have to downsave the PDF(s) before running. The last time I was using this workflow regularly, I'd have to save it as a lower PDF version, optimized for compatibility, then take *that* PDF and export each of the separate pages. Unfortunately, that bit of software predates the last few PDF version updates, so there are backwards-compatibility issues.

Meteoroid

Is the DoctoText software the best option? I receive an error when unzipping the file. 

Any other converters that can be used offline?

 

**Winzip was able to extract only a part of the file --- because it contains invalid or missing data. Would you like to open the partial file? ...CORRUPT.doctotext-4.0.1512-win64.tar **

Alteryx Partner

Is it normal that this worflow generate a TXT with data on the "(C:)" and empty txt on the "Google Drive File Stream (G:)" ?

Is ther any way to solve this problem ?

Asteroid

Is this still available - I get download it, gets to 61% complete then hangs (unless its my corporate netwrok blocking it for some reason)