This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
on 08-25-201411:22 PM - edited on 03-11-201909:36 AM by SydneyF
One of the biggest reasons why people love Alteryx is that it has the ability to read a very large number of different data sources. One limitation is that it cannot read in a pdf or word doc without a little help from another source. Why would someone want to do this? Well, one excellent example would be to parse a folder full of resumes to search for specific text.
Why can't Alteryx read them natively? These file types are not standard data formats, so in order to read them, we must first convert them to a plain text file. To convert, there is a free, open-source program called DocToText. This program can be run at the command line to convert these file types to plain text, which Alteryx can read with no issue.
I've included an example attached to this post. This workflow utilizes an often underused tool, the Run Command tool. With the help of this tool, we can read in a list of files from a specific source folder, parse the info into something DocToText can use, then use the RunCmd Tool to convert all files to plain text for further consumption. I've included everything you will need in the attachment (including a folder structure that works well with the module).
Download and extract the attached .yxzp file, check out the module, and let us know what you think! This example has been updated for version 10.0. You'll notice the package will produce a couple of dependency errors when you extract it. That's ok, it won't error on run.