Alteryx Designer Knowledge Base

Definitive answers from Designer experts.

Can Alteryx Parse A Word Doc Or PDF?

Alteryx Alumni (Retired)

One of the biggest reasons why people love Alteryx is that it has the ability to read a very large number of different data sources.  One limitation is that it cannot read in a pdf or word doc without a little help from another source.  Why would someone want to do this?  Well, one excellent example would be to parse a folder full of resumes to search for specific text. 

Why can't Alteryx read them natively? These file types are not standard data formats, so in order to read them, we must first convert them to a plain text file.  To convert, there is a free, open-source program called DocToText. This program can be run at the command line to convert these file types to plain text, which Alteryx can read with no issue.

I've included an example attached to this post.  This workflow utilizes an often underused tool, the Run Command tool.  With the help of this tool, we can read in a list of files from a specific source folder, parse the info into something DocToText can use, then use the RunCmd Tool to convert all files to plain text for further consumption.  I've included everything you will need in the attachment (including a folder structure that works well with the module). 

Download and extract the attached .yxzp file, check out the module, and let us know what you think! This example has been updated for version 10.0. You'll notice the package will produce a couple of dependency errors when you extract it. That's ok, it won't error on run.

9 - Comet

This sounds like a pretty complicated concept, if you don't have any advance knowledge of the format for each invoice (much less how they'll look as extracted text). Most methods I could find involved some sort of template identification, each with associated parameters that would tell the software how to parse the data. You're also going to be looking at structured (if you're fortunate), unstructured and semi-structured data. Each of those has a different approach, and raises more questions: are you capturing one row per invoice or one per line item? How will you connect the line item data with data shared across the invoice or vendor? 


If you have some kind of CASS address parsing/identification software, you could take what you infer to be the address block, run each pair of lines through the parser, identify the postal address lines, then take the line(s) above those as the vendor name or contact. In the case of my initial line-number example, you could have a translation table that includes a column associated with the vendor. It's not in my screenshot, but I also have an offshoot that exports records that don't pass muster. As long as there are few of those each time I nudge the field ID scheme, it's a finite amount of time before I have them all sorted. Obviously that isn't really scalable, but it gets into that near-enterprise range; if I have to manually intervene a few hundred times, on one occasion, it's worth it for the time saved. That approach doesn't work so well with hundreds/thousands of new vendors per billing period. 


One place Alteryx shines, in my opinion, is in the context of the mostly-automated workflow. Sort of like dealing with real-time data. Using recent cached data saves a lot of overhead, and is often just fine. The same can be said of automated processes. If you can automate even 80% of a process that was not initially possible, it's a huge benefit in fairly short order, and you can work on reaching 90% or 95% once your (sometimes clunky) process is up and running, and you build more familiarity with the data. 


There are a few python text-parsing tools, and it's possible one or more could be useful adapted to the Alteryx R module (fuzzy address mapping, maybe?), but most will put you back to the question of whether you should treat the data as coming from a finite (but expandable) set of vendors or an indeterminate set of vendors. 

11 - Bolide

The text is coming in as one row (string) per invoice in Alterxy. At this point, i'm not trying to identify line level spend info from the invoice.... just header level information such as supplier name, invoice number, and date. We have a semi-automated invoice process that we pay for, however, there have been reports of lost invoices. My goal is to scan all the PDF invoices we have in the email box to determine how many unique (based on invoice number) invoices we are getting per month, per vendor to check for missing invoices. We get ~500 invoices a day from various vendors. I have a large number of invoices that have been saved with the vendor name (Exactly as it appears in the PDF)  as the file name.  If my beta access request is approved, i'm going  to run that data through the new Alterxy Machine Learning tools. 

7 - Meteor

Unfortunately, I am still seeing the error below when running the macro through Alteryx. The only change made to the macro was a change in folder. The error appears for all pdf files that I present to it, including some that are not scans of hard copies and even a test save-to-pdf from Excel (2nd image).





When running the macro directly in the Command Prompt, the error above disappears, however, the resulting .pdf.txt files are all empty (0kb in sample below). 




The only file that has successfully scanned and loaded content to the output .pdf.txt file is the sample resume provided in the download above. Would this issue of empty files be tied to the type of target pdf file?


6 - Meteoroid

Hello All,


I am also receiving the Error Code:1 when switching the folder directory. Does anybody know what we need to reconfigure to fix the workflow? I tried copying the container folder contents to my folder but I am still receiving this error. I tried to manually run the bat file but it just comes back with a blank notes file. Any help would be much appreciated. 

9 - Comet

It's possible you have to downsave the PDF(s) before running. The last time I was using this workflow regularly, I'd have to save it as a lower PDF version, optimized for compatibility, then take *that* PDF and export each of the separate pages. Unfortunately, that bit of software predates the last few PDF version updates, so there are backwards-compatibility issues.

6 - Meteoroid

Is the DoctoText software the best option? I receive an error when unzipping the file. 

Any other converters that can be used offline?


**Winzip was able to extract only a part of the file --- because it contains invalid or missing data. Would you like to open the partial file? ...CORRUPT.doctotext-4.0.1512-win64.tar **

Alteryx Partner

Is it normal that this worflow generate a TXT with data on the "(C:)" and empty txt on the "Google Drive File Stream (G:)" ?

Is ther any way to solve this problem ?

8 - Asteroid

Is this still available - I get download it, gets to 61% complete then hangs (unless its my corporate netwrok blocking it for some reason)

6 - Meteoroid


I have a multi page PDF which consists of a table. The headers of the table are repeated on every page and each page has about 10-12 rows depending on the size of the row ( text wrapping in the source apparently). Unfortunately we have to convert this back to excel. Each page also has the company branding on the top as well which has some text. Is there any PDF parser that can handle this. I am new to Alteryx and saw that this discussion is still happening for several years. 

8 - Asteroid


Any workaround from PDF to Excel or write in yxdb.

6 - Meteoroid



I followed these instructions, and noticed the error below occurs (please see image). Any way to drive through this error when converting PDF to Excel?


install.packages("Rcpp",  dependencies = TRUE, repos = "")

install.packages("pdftools",  dependencies = TRUE, repos = "")


Note: The Rcpp package is a dependency and is not necessary but I use it to prevent issues that occur with other R GUI's.


Now define your data input (The FilePath to your pdf found using the directory tool)


data <- read.Alteryx("#1", mode="data.frame")


Finally change the format of your data:

         1                2            3           4          5  $    6            7

write.Alteryx(pdftools::pdf_text(file.path(data$FullPath)), 1)


Breakdown of the code:

1 & 7 = Alteryx specific R code that defines the output

2 = calls the package we will be using

3 = the command that will convert the pdf to text

4 = used to reformat the cell in our data frame as a file path

5 = the data frame we defined earlier

$ = print

6 = the field name of the cell from the directory tool


There it is a very simple solution that allows us to convert pdf to a usable format with in Alteryx.




Hi @organicchocolate, the error is indicating that you have no data coming into connection (#1). In the screenshot, there is no input connection named #1 connected to the R Tool.


Generally you will have a lot more chance of a response if you post a new question in the designer forum for this rather than responding to a blog post. The only people that will see this are previous respondents and that's if they have notifications turned on for it.

5 - Atom

Fantastic solution!

Exactly what I was looking for 😁