Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Knowledge Base

Definitive answers from Designer Desktop experts.

Can Alteryx Parse A Word Doc Or PDF?

ChadM
Alteryx Alumni (Retired)
Created

One of the biggest reasons why people love Alteryx is that it has the ability to read a very large number of different data sources.  One limitation is that it cannot read in a pdf or word doc without a little help from another source.  Why would someone want to do this?  Well, one excellent example would be to parse a folder full of resumes to search for specific text. 

Why can't Alteryx read them natively? These file types are not standard data formats, so in order to read them, we must first convert them to a plain text file.  To convert, there is a free, open-source program called DocToText. This program can be run at the command line to convert these file types to plain text, which Alteryx can read with no issue.

I've included an example attached to this post.  This workflow utilizes an often underused tool, the Run Command tool.  With the help of this tool, we can read in a list of files from a specific source folder, parse the info into something DocToText can use, then use the RunCmd Tool to convert all files to plain text for further consumption.  I've included everything you will need in the attachment (including a folder structure that works well with the module). 

Download and extract the attached .yxzp file, check out the module, and let us know what you think! This example has been updated for version 10.0. You'll notice the package will produce a couple of dependency errors when you extract it. That's ok, it won't error on run.

Attachments
Comments
daniel_mmi
9 - Comet

This sounds like a pretty complicated concept, if you don't have any advance knowledge of the format for each invoice (much less how they'll look as extracted text). Most methods I could find involved some sort of template identification, each with associated parameters that would tell the software how to parse the data. You're also going to be looking at structured (if you're fortunate), unstructured and semi-structured data. Each of those has a different approach, and raises more questions: are you capturing one row per invoice or one per line item? How will you connect the line item data with data shared across the invoice or vendor? 

 

If you have some kind of CASS address parsing/identification software, you could take what you infer to be the address block, run each pair of lines through the parser, identify the postal address lines, then take the line(s) above those as the vendor name or contact. In the case of my initial line-number example, you could have a translation table that includes a column associated with the vendor. It's not in my screenshot, but I also have an offshoot that exports records that don't pass muster. As long as there are few of those each time I nudge the field ID scheme, it's a finite amount of time before I have them all sorted. Obviously that isn't really scalable, but it gets into that near-enterprise range; if I have to manually intervene a few hundred times, on one occasion, it's worth it for the time saved. That approach doesn't work so well with hundreds/thousands of new vendors per billing period. 

 

One place Alteryx shines, in my opinion, is in the context of the mostly-automated workflow. Sort of like dealing with real-time data. Using recent cached data saves a lot of overhead, and is often just fine. The same can be said of automated processes. If you can automate even 80% of a process that was not initially possible, it's a huge benefit in fairly short order, and you can work on reaching 90% or 95% once your (sometimes clunky) process is up and running, and you build more familiarity with the data. 

 

There are a few python text-parsing tools, and it's possible one or more could be useful adapted to the Alteryx R module (fuzzy address mapping, maybe?), but most will put you back to the question of whether you should treat the data as coming from a finite (but expandable) set of vendors or an indeterminate set of vendors. 

AlteryxUserFL
11 - Bolide

The text is coming in as one row (string) per invoice in Alterxy. At this point, i'm not trying to identify line level spend info from the invoice.... just header level information such as supplier name, invoice number, and date. We have a semi-automated invoice process that we pay for, however, there have been reports of lost invoices. My goal is to scan all the PDF invoices we have in the email box to determine how many unique (based on invoice number) invoices we are getting per month, per vendor to check for missing invoices. We get ~500 invoices a day from various vendors. I have a large number of invoices that have been saved with the vendor name (Exactly as it appears in the PDF)  as the file name.  If my beta access request is approved, i'm going  to run that data through the new Alterxy Machine Learning tools. 

aesanchez
7 - Meteor

Unfortunately, I am still seeing the error below when running the macro through Alteryx. The only change made to the macro was a change in folder. The error appears for all pdf files that I present to it, including some that are not scans of hard copies and even a test save-to-pdf from Excel (2nd image).

 

Error.PNG

 

 

When running the macro directly in the Command Prompt, the error above disappears, however, the resulting .pdf.txt files are all empty (0kb in sample below). 

 

cutepdfview.PNG

 

The only file that has successfully scanned and loaded content to the output .pdf.txt file is the sample resume provided in the download above. Would this issue of empty files be tied to the type of target pdf file?

 

Jpschnee
6 - Meteoroid

Hello All,

 

I am also receiving the Error Code:1 when switching the folder directory. Does anybody know what we need to reconfigure to fix the workflow? I tried copying the container folder contents to my folder but I am still receiving this error. I tried to manually run the bat file but it just comes back with a blank notes file. Any help would be much appreciated. 

daniel_mmi
9 - Comet

It's possible you have to downsave the PDF(s) before running. The last time I was using this workflow regularly, I'd have to save it as a lower PDF version, optimized for compatibility, then take *that* PDF and export each of the separate pages. Unfortunately, that bit of software predates the last few PDF version updates, so there are backwards-compatibility issues.

BGDAR
6 - Meteoroid

Is the DoctoText software the best option? I receive an error when unzipping the file. 

Any other converters that can be used offline?

 

**Winzip was able to extract only a part of the file --- because it contains invalid or missing data. Would you like to open the partial file? ...CORRUPT.doctotext-4.0.1512-win64.tar **

SeAub
5 - Atom

Is it normal that this worflow generate a TXT with data on the "(C:)" and empty txt on the "Google Drive File Stream (G:)" ?

Is ther any way to solve this problem ?

craigja
8 - Asteroid

Is this still available - I get download it, gets to 61% complete then hangs (unless its my corporate netwrok blocking it for some reason)

rite2vinoth
6 - Meteoroid

Hi

I have a multi page PDF which consists of a table. The headers of the table are repeated on every page and each page has about 10-12 rows depending on the size of the row ( text wrapping in the source apparently). Unfortunately we have to convert this back to excel. Each page also has the company branding on the top as well which has some text. Is there any PDF parser that can handle this. I am new to Alteryx and saw that this discussion is still happening for several years. 

rohit782192
11 - Bolide

Hi,

Any workaround from PDF to Excel or write in yxdb.

organicchocolate
6 - Meteoroid

Hello,

 

I followed these instructions, and noticed the error below occurs (please see image). Any way to drive through this error when converting PDF to Excel?

 

install.packages("Rcpp",  dependencies = TRUE, repos = "http://cran.us.r-project.org")

install.packages("pdftools",  dependencies = TRUE, repos = "http://cran.us.r-project.org")

 

Note: The Rcpp package is a dependency and is not necessary but I use it to prevent issues that occur with other R GUI's.

 

Now define your data input (The FilePath to your pdf found using the directory tool)

 

data <- read.Alteryx("#1", mode="data.frame")

 

Finally change the format of your data:

         1                2            3           4          5  $    6            7

write.Alteryx(pdftools::pdf_text(file.path(data$FullPath)), 1)

 

Breakdown of the code:

1 & 7 = Alteryx specific R code that defines the output

2 = calls the package we will be using

3 = the command that will convert the pdf to text

4 = used to reformat the cell in our data frame as a file path

5 = the data frame we defined earlier

$ = print

6 = the field name of the cell from the directory tool

 

There it is a very simple solution that allows us to convert pdf to a usable format with in Alteryx.

 

Picture1.png

KaneG
Alteryx Alumni (Retired)

Hi @organicchocolate, the error is indicating that you have no data coming into connection (#1). In the screenshot, there is no input connection named #1 connected to the R Tool.

 

Generally you will have a lot more chance of a response if you post a new question in the designer forum for this rather than responding to a blog post. The only people that will see this are previous respondents and that's if they have notifications turned on for it.

Michael_H
5 - Atom

Fantastic solution!

Exactly what I was looking for 😁

mihir_mir_jb
8 - Asteroid

@ChadM , I was going through this thread of yours where you have created a wonderful workflow. I tried using the same and gave me an error "The external program "runBat.bat" returned an error code: 1" 

 

Could you please help me how to resolve this issue?

 

Thanks 

Mihir

CailinS
Alteryx
Alteryx

@mihir_mir_jb  check out this other thread where users are troubleshooting the same error. It may be the result of an issue in the 'pathing' of the workflow. You'll likely need to make some modifications to yours paths (once you confirm the necessary pieces of the process are in the expected folder paths). https://community.alteryx.com/t5/Alteryx-Designer-Discussions/How-to-convert-attached-PDF-to-Excel-a... 

mihir_mir_jb
8 - Asteroid

@CailinS Thank you for replying.

 

Below is the screen shot where I have saved all the files in one place, in addition to these files I also saved the pdf files in the same folder below.(I have temporarily removed the pdf) However I still get the same error. Do you think anything needs to be amended in the Run command ? 

 

mihir_mir_jb_0-1612455236302.png

 

 

 

mihir_mir_jb_0-1612455380792.png

 

CailinS
Alteryx
Alteryx

@mihir_mir_jb you may need to review the actual .bat to confirm pathing matches your personal paths. Also, I saw a prior thread stating '


It's possible you have to downsave the PDF(s) before running. The last time I was using this workflow regularly, I'd have to save it as a lower PDF version, optimized for compatibility, then take *that* PDF and export each of the separate pages. Unfortunately, that bit of software predates the last few PDF version updates, so there are backwards-compatibility issues.'

 

I can't confirm the exact cause of your error, unfortunately.

Idyllic_Data_Geek
8 - Asteroid

Using PDF parser.
Cant open Letter 1.pdf.pdf for reading
It is possible that wrong parser was selected. Trying different parsers.
Trying to detect document format by its content.
Error opening file Letter 1.pdf.pdf.
Error processing file Letter 1.pdf.pdf.

 

 

 

I get this error...any idea why?

Manjari
8 - Asteroid

Hi @mpate and @Chad :

 

I am getting the same error  "Error: Run Command (3): The external program "runBat.bat" returned an error code: 1"

 

Were you able to fix the error. Any help will be greatly appreciated.

rohit782192
11 - Bolide

Which license is needed to read pdf files if we dont download these . It only work for Word documents.

tgawade
5 - Atom

My records are just doubling every time I run. Can anyone show me the entire workflow