A question recently came across our internal Roundtable discussion group that asked the following:
I am working with a prospect who would like to dynamically read in 40 csv files from Hadoop which have the same data schema.
I have attached an input tool with a list of the file path - folder/subfolder/filename.csv and have used a dynamic input to change the file/table name in the connection string. I have tried several different variations but have not successfully read in the files.
Can someone please tell me how to set this up properly?
Dan Hilton of our Professional Services team provided the answer:
The Dynamic Input tool has some limitations. I generally use a normal Input tool within a batch macro for these scenarios as it's more flexible and bullet-proof than the Dynamic Input.
Dan’s answer is spot on, and highlights one of the beautiful aspects of Alteryx: its amazing flexibility. While we in Product Development strive to make Alteryx as robust and functionally complete as possible, there are always going to be use cases that we either haven’t thought of yet or haven’t gotten around to implementing yet. Quite often, though, it’s fairly easy to fill those gaps by taking advantage of the extensibility built into the product via Macros, Apps, or, for the more technically savvy, the APIs and SDKs.
Given this, it seemed to me that it should be fairly easy to build out a more generic solution based on Dan’s answer that would allow a user to specify an HDFS connection, a folder, and a wildcard search, and have Alteryx read all of the matching files as a single input stream. I decided to take this inspiration and build such a tool from the ground up as a set of Macros. The top-level Macro, HDFSDynamicInput, does exactly what I described above, but the actual work to do all of that is split up into other, lower-level Macros, each of which may be independently useful.
The rest of this post goes through the details of how these Macros were put together, but if you are less interested in those details and really just want to start using the Macros, you can download them here - All of the macros are in a single package.
Let’s start by looking at the end product. The HDFSDynamicInput Macro allows the user to specify their HDFS server (using WebHDFS or HTTPFS), the HDFS Path and Wildcard Search, and the type of files that are to be read (Alteryx currently only supports reading Avro and CSV files from HDFS).
The HDFS Server field requires both the HDFS Host Name and Port. You can also specify the transport mechanism (http or https) if you so choose.
The HDFS Path and Search field allows you to specify the folder and wildcard search for the files you wish to import. All of the files must be in the same folder as folder recursion is not supported.
As mentioned above, the supported File Types are limited to Avro and CSV. If you select CSV as your File Type, you will have the same CSV options as you would in the standard Input Tool. This allows you to change the field delimiter, specify whether or not the first row contains field names, change the field length and the line to start importing data from, specify how quotes and read errors are handled, set the code page for the input data, and specify whether shared write access is allowed.
The Macro has two output streams, labeled d and i. The d (for data) output stream contains the data from all of the files that meet the specified search criteria. The first field in the output will be the HDFS_Source which contains the name of the file that the record came from.
The i (for info) output stream contains the list of all of the filenames that matched the specified search criteria, and includes the HDFS file system information for each file.
The guts of the HDFSDynamicInput tool are fairly straightforward. We have a text input with fields for the HDFS Server and the Folder Path and Wildcard Search that get overridden by the Macro inputs. The Folder Path and Wildcard Search is passed into the SplitPath Macro which splits it into separate fields for the Folder Path and the Wildcard Search. These are then passed into the SearchHDFSBatchWrapper Macro. This wraps the SearchHDFS Macro as a Batch Macro so that its inputs can come from the incoming stream instead of from the Macro user interface. The SearchHDFS Macro returns the list of files that match the Folder Path and Wildcard Search criteria. We then append the HDFS Server information back into each record, append the specified Folder Path to the file names that were returned from the search, and pass that information into the HDFSReadBatch Macro. The HDFSReadBatch Macro is responsible for reading each of the files returned from the SearchHDFS Macro and streaming their contents out. The HDFSReadBatch Macro has the same File Type options as the HDFSDynamicInput Macro, so these values are simply sent down.
Back to the list of HDFS Macros
The SplitPath Macro uses a RegEx Tool to extract the last part of a path string (i.e., /folder1/folder2/file.ext). The path separator can be specified and defaults to “/” which is what HDFS expects.
As for the regex expression used to split the path, I have to admit that I’m no regex expert. The expression I used is based on other expressions I found on the web, and I used a web-based regex tester to modify it until it did what I wanted it to do. There are a bunch of regex testers out there, and a simple Google search will help you find them.
Back to the list of HDFS Macros
As I said above, the SearchHDFSBatchWrapper Macro simply wraps the SearchHDFS Macro as a Batch Macro so that its inputs can be specified via the input stream rather than the Macro user interface. Remember that a Batch Macro is a special kind of macro that is designed to be run repeatedly in the context of a workflow. It takes a different type of input called a Control Parameter. For each record coming into the Control Parameter input, the entire macro will be re-configured and run beginning to end (see https://help.alteryx.com/current/index.htm#BatchMacro.htm).
The SearchHDFSBatchWrapper Macro takes three inputs: the HDFS Server, HDFS Path, and Wildcard Search string. These are passed into the SearchHDFS Macro and the results are sent to the Macro Output.
Back to the list of HDFS Macros
The SearchHDFS Macro takes the HDFS Server, HDFS Path, and Wildcard Search text, and returns the list of all files within the specified path that match the search criteria. It does this by using the Download Tool to call the WebHDFS or HTTPFS LISTSTATUS endpoint on the specified server. In order to do that, we need to build up the URL for that endpoint given the provided HDFS Server and HDFS Path information.
The returned data looks something like this:
{
"FileStatuses": {
"FileStatus": [{
"accessTime": 1479408817975,
"blockSize": 134217728,
"childrenNum": 0,
"fileId": 1180981,
"group": "hdfs",
"length": 9627,
"modificationTime": 1479408818182,
"owner": "hdfs",
"pathSuffix": "alteryx-log-01.csv",
"permission": "755",
"replication": 3,
"storagePolicy": 0,
"type": "FILE"
}, {
"accessTime": 1479393492165,
"blockSize": 134217728,
"childrenNum": 0,
"fileId": 1179108,
"group": "hdfs",
"length": 60408,
"modificationTime": 1479393492727,
"owner": "hdfs",
"pathSuffix": "alteryx-log-02.csv",
"permission": "755",
"replication": 3,
"storagePolicy": 0,
"type": "FILE"
}]
}
We parse this data using the JSON Parse Tool which turns it into record data that looks like this:
JSON_Name |
JSON_ValueString |
FileStatuses.FileStatus.0.accessTime |
1479408817975 |
FileStatuses.FileStatus.0.blockSize |
134217728 |
FileStatuses.FileStatus.0.childrenNum |
0 |
FileStatuses.FileStatus.0.fileId |
1180981 |
FileStatuses.FileStatus.0.group |
hdfs |
FileStatuses.FileStatus.0.length |
9627 |
FileStatuses.FileStatus.0.modificationTime |
1479408818182 |
FileStatuses.FileStatus.0.owner |
hdfs |
FileStatuses.FileStatus.0.pathSuffix |
alteryx-log-01.csv |
FileStatuses.FileStatus.0.permission |
755 |
FileStatuses.FileStatus.0.replication |
3 |
FileStatuses.FileStatus.0.storagePolicy |
0 |
FileStatuses.FileStatus.0.type |
FILE |
FileStatuses.FileStatus.1.accessTime |
1479393492165 |
FileStatuses.FileStatus.1.blockSize |
134217728 |
FileStatuses.FileStatus.1.childrenNum |
0 |
FileStatuses.FileStatus.1.fileId |
1179108 |
FileStatuses.FileStatus.1.group |
hdfs |
FileStatuses.FileStatus.1.length |
60408 |
FileStatuses.FileStatus.1.modificationTime |
1479393492727 |
FileStatuses.FileStatus.1.owner |
hdfs |
FileStatuses.FileStatus.1.pathSuffix |
alteryx-log-02.csv |
FileStatuses.FileStatus.1.permission |
755 |
FileStatuses.FileStatus.1.replication |
3 |
FileStatuses.FileStatus.1.storagePolicy |
0 |
FileStatuses.FileStatus.1.type |
FILE |
FileStatuses.FileStatus.2.accessTime |
1479408846474 |
FileStatuses.FileStatus.2.blockSize |
134217728 |
FileStatuses.FileStatus.2.childrenNum |
0 |
FileStatuses.FileStatus.2.fileId |
1180985 |
FileStatuses.FileStatus.2.group |
hdfs |
FileStatuses.FileStatus.2.length |
3753 |
FileStatuses.FileStatus.2.modificationTime |
1479408846609 |
FileStatuses.FileStatus.2.owner |
hdfs |
Using the Text To Columns Tool, we can split the JSON_Name field to extract the file index and attribute name, which we then send to the Cross Tab Tool so that we can get the information for each file onto single rows, which looks like this:
FileIndex |
accessTime |
blockSize |
childrenNum |
fileId |
group |
length |
modificationTime |
owner |
pathSuffix |
permission |
replication |
storagePolicy |
type |
0 |
1479408817975 |
134217728 |
0 |
1180981 |
hdfs |
9627 |
1479408818182 |
hdfs |
alteryx-log-01.csv |
755 |
3 |
0 |
FILE |
1 |
1479393492165 |
134217728 |
0 |
1179108 |
hdfs |
60408 |
1479393492727 |
hdfs |
alteryx-log-02.csv |
755 |
3 |
0 |
FILE |
We then filter out any results where the type attribute is not “FILE” and then apply the Wildcard Search via the WildcardFilter Macro. The date information for each matching file is converted from a Unix timestamp to DateTime fields (thanks to one of our Alteryx ACE’s, Michael Treadwell’s community post for providing that answer), and the results are sorted by file name.
Back to the list of HDFS Macros
The WildcardFilter Macro filters a list of text by whether or not it matches the specified Wildcard Search criteria. Since Alteryx doesn’t natively support Wildcard filtering, the Wildcard Search text is converted into a RegEx expression which is then passed into the RegEx Tool to perform the matching. Results are then filtered by whether they matched or not.
To keep things simple, I only handle the * wildcard (for zero or more characters), but it would probably not be too difficult to support others.
Back to the list of HDFS Macros
The HDFSReadBatch Macro takes a list of HDFS Server and HDFS File Path information and reads each file in turn, sending the content downstream. The source file name is appended to each record before it is sent to the Macro Output so that data can be related back to their source. The SplitPath Macro is used again here to extract the file name from the full file name and path string provided as the input, and the Macro Output is set to Auto Configure by Name (Wait Until All Iterations Run) so that it will work with files that aren’t completely identical in layout.
The HDFSReadBatch Macro also supports the same File Type Options as the HDFSDynamicInput Macro, allowing the user to specify Avro or CSV, and the appropriate CSV options.
As with all things, there is always room for improvement, and I will leave that as an opportunity for others to explore. A couple suggestions that I think would be worth pursuing include:
You can download all of the Macros in this post here (All of the macros are in a single package). I hope you find them useful, but more importantly I hope I have helped demonstrate how flexible Alteryx can be, even when what you want to do isn’t immediately available out of the box. Macros are the easiest way to expand the already vast collection of Alteryx tools, and if you’re not already familiar with how to build them, I encourage you to read up on them in the help documentation or watch one of our introductory videos.
I would like to thank @DanH for his Roundtable response that prompted this post, and also @SteveA who provided critical feedback on the Macros as I was working on them.
Finally, while these Macros are free for you to use, please understand that they are not officially sanctioned by Alteryx nor can they be officially supported. That said, I’m happy to respond to any questions or feedback you may have, time allowing.
I've been with Alteryx since 2010 and have worked primarily on the Gallery and Server products. I've recently joined a new team focused on improving how Alteryx integrates with distributed computing platforms.
I've been with Alteryx since 2010 and have worked primarily on the Gallery and Server products. I've recently joined a new team focused on improving how Alteryx integrates with distributed computing platforms.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.