Engine Works

GaryS · ‎11-28-2016

A question recently came across our internal Roundtable discussion group that asked the following:

I am working with a prospect who would like to dynamically read in 40 csv files from Hadoop which have the same data schema.

I have attached an input tool with a list of the file path - folder/subfolder/filename.csv and have used a dynamic input to change the file/table name in the connection string. I have tried several different variations but have not successfully read in the files.

Can someone please tell me how to set this up properly?

Dan Hilton of our Professional Services team provided the answer:

The Dynamic Input tool has some limitations. I generally use a normal Input tool within a batch macro for these scenarios as it's more flexible and bullet-proof than the Dynamic Input.

Dan’s answer is spot on, and highlights one of the beautiful aspects of Alteryx: its amazing flexibility. While we in Product Development strive to make Alteryx as robust and functionally complete as possible, there are always going to be use cases that we either haven’t thought of yet or haven’t gotten around to implementing yet. Quite often, though, it’s fairly easy to fill those gaps by taking advantage of the extensibility built into the product via Macros, Apps, or, for the more technically savvy, the APIs and SDKs.

Given this, it seemed to me that it should be fairly easy to build out a more generic solution based on Dan’s answer that would allow a user to specify an HDFS connection, a folder, and a wildcard search, and have Alteryx read all of the matching files as a single input stream. I decided to take this inspiration and build such a tool from the ground up as a set of Macros. The top-level Macro, HDFSDynamicInput, does exactly what I described above, but the actual work to do all of that is split up into other, lower-level Macros, each of which may be independently useful.

The rest of this post goes through the details of how these Macros were put together, but if you are less interested in those details and really just want to start using the Macros, you can download them here - All of the macros are in a single package.

List of the HDFS Macros:

HDFS Dynamic Input
Split Path
Search HDFS Batch Wrapper
Search HDFS
Wildcard Filter
HDFS Read Batch

HDFS Dynamic Input

Let’s start by looking at the end product. The HDFSDynamicInput Macro allows the user to specify their HDFS server (using WebHDFS or HTTPFS), the HDFS Path and Wildcard Search, and the type of files that are to be read (Alteryx currently only supports reading Avro and CSV files from HDFS).

The HDFS Server field requires both the HDFS Host Name and Port. You can also specify the transport mechanism (http or https) if you so choose.

The HDFS Path and Search field allows you to specify the folder and wildcard search for the files you wish to import. All of the files must be in the same folder as folder recursion is not supported.

As mentioned above, the supported File Types are limited to Avro and CSV. If you select CSV as your File Type, you will have the same CSV options as you would in the standard Input Tool. This allows you to change the field delimiter, specify whether or not the first row contains field names, change the field length and the line to start importing data from, specify how quotes and read errors are handled, set the code page for the input data, and specify whether shared write access is allowed.

The Macro has two output streams, labeled d and i. The d (for data) output stream contains the data from all of the files that meet the specified search criteria. The first field in the output will be the HDFS_Source which contains the name of the file that the record came from.

The i (for info) output stream contains the list of all of the filenames that matched the specified search criteria, and includes the HDFS file system information for each file.

The guts of the HDFSDynamicInput tool are fairly straightforward. We have a text input with fields for the HDFS Server and the Folder Path and Wildcard Search that get overridden by the Macro inputs. The Folder Path and Wildcard Search is passed into the SplitPath Macro which splits it into separate fields for the Folder Path and the Wildcard Search. These are then passed into the SearchHDFSBatchWrapper Macro. This wraps the SearchHDFS Macro as a Batch Macro so that its inputs can come from the incoming stream instead of from the Macro user interface. The SearchHDFS Macro returns the list of files that match the Folder Path and Wildcard Search criteria. We then append the HDFS Server information back into each record, append the specified Folder Path to the file names that were returned from the search, and pass that information into the HDFSReadBatch Macro. The HDFSReadBatch Macro is responsible for reading each of the files returned from the SearchHDFS Macro and streaming their contents out. The HDFSReadBatch Macro has the same File Type options as the HDFSDynamicInput Macro, so these values are simply sent down.

Back to the list of HDFS Macros

Split Path

The SplitPath Macro uses a RegEx Tool to extract the last part of a path string (i.e., /folder1/folder2/file.ext). The path separator can be specified and defaults to “/” which is what HDFS expects.

As for the regex expression used to split the path, I have to admit that I’m no regex expert. The expression I used is based on other expressions I found on the web, and I used a web-based regex tester to modify it until it did what I wanted it to do. There are a bunch of regex testers out there, and a simple Google search will help you find them.

Back to the list of HDFS Macros

Search HDFS Batch Wrapper

As I said above, the SearchHDFSBatchWrapper Macro simply wraps the SearchHDFS Macro as a Batch Macro so that its inputs can be specified via the input stream rather than the Macro user interface. Remember that a Batch Macro is a special kind of macro that is designed to be run repeatedly in the context of a workflow. It takes a different type of input called a Control Parameter. For each record coming into the Control Parameter input, the entire macro will be re-configured and run beginning to end (see https://help.alteryx.com/current/index.htm#BatchMacro.htm).

The SearchHDFSBatchWrapper Macro takes three inputs: the HDFS Server, HDFS Path, and Wildcard Search string. These are passed into the SearchHDFS Macro and the results are sent to the Macro Output.

Back to the list of HDFS Macros

Search HDFS

The SearchHDFS Macro takes the HDFS Server, HDFS Path, and Wildcard Search text, and returns the list of all files within the specified path that match the search criteria. It does this by using the Download Tool to call the WebHDFS or HTTPFS LISTSTATUS endpoint on the specified server. In order to do that, we need to build up the URL for that endpoint given the provided HDFS Server and HDFS Path information.

The returned data looks something like this:

{

                "FileStatuses": {
                                "FileStatus": [{
                                                "accessTime": 1479408817975,
                                                "blockSize": 134217728,
                                                "childrenNum": 0,
                                                "fileId": 1180981,
                                                "group": "hdfs",
                                                "length": 9627,
                                                "modificationTime": 1479408818182,
                                                "owner": "hdfs",
                                                "pathSuffix": "alteryx-log-01.csv",
                                                "permission": "755",
                                                "replication": 3,
                                                "storagePolicy": 0,
                                                "type": "FILE"
                                }, {
                                                "accessTime": 1479393492165,
                                                "blockSize": 134217728,
                                                "childrenNum": 0,
                                                "fileId": 1179108,
                                                "group": "hdfs",
                                                "length": 60408,
                                                "modificationTime": 1479393492727,
                                                "owner": "hdfs",
                                                "pathSuffix": "alteryx-log-02.csv",
                                                "permission": "755",
                                                "replication": 3,
                                                "storagePolicy": 0,
                                                "type": "FILE"
                                }]
                }

We parse this data using the JSON Parse Tool which turns it into record data that looks like this:

JSON_Name	JSON_ValueString
FileStatuses.FileStatus.0.accessTime	1479408817975
FileStatuses.FileStatus.0.blockSize	134217728
FileStatuses.FileStatus.0.childrenNum	0
FileStatuses.FileStatus.0.fileId	1180981
FileStatuses.FileStatus.0.group	hdfs
FileStatuses.FileStatus.0.length	9627
FileStatuses.FileStatus.0.modificationTime	1479408818182
FileStatuses.FileStatus.0.owner	hdfs
FileStatuses.FileStatus.0.pathSuffix	alteryx-log-01.csv
FileStatuses.FileStatus.0.permission	755
FileStatuses.FileStatus.0.replication	3
FileStatuses.FileStatus.0.storagePolicy	0
FileStatuses.FileStatus.0.type	FILE
FileStatuses.FileStatus.1.accessTime	1479393492165
FileStatuses.FileStatus.1.blockSize	134217728
FileStatuses.FileStatus.1.childrenNum	0
FileStatuses.FileStatus.1.fileId	1179108
FileStatuses.FileStatus.1.group	hdfs
FileStatuses.FileStatus.1.length	60408
FileStatuses.FileStatus.1.modificationTime	1479393492727
FileStatuses.FileStatus.1.owner	hdfs
FileStatuses.FileStatus.1.pathSuffix	alteryx-log-02.csv
FileStatuses.FileStatus.1.permission	755
FileStatuses.FileStatus.1.replication	3
FileStatuses.FileStatus.1.storagePolicy	0
FileStatuses.FileStatus.1.type	FILE
FileStatuses.FileStatus.2.accessTime	1479408846474
FileStatuses.FileStatus.2.blockSize	134217728
FileStatuses.FileStatus.2.childrenNum	0
FileStatuses.FileStatus.2.fileId	1180985
FileStatuses.FileStatus.2.group	hdfs
FileStatuses.FileStatus.2.length	3753
FileStatuses.FileStatus.2.modificationTime	1479408846609
FileStatuses.FileStatus.2.owner	hdfs

Using the Text To Columns Tool, we can split the JSON_Name field to extract the file index and attribute name, which we then send to the Cross Tab Tool so that we can get the information for each file onto single rows, which looks like this:

FileIndex	accessTime	blockSize	childrenNum	fileId	group	length	modificationTime	owner	pathSuffix	permission	replication	storagePolicy	type
0	1479408817975	134217728	0	1180981	hdfs	9627	1479408818182	hdfs	alteryx-log-01.csv	755	3	0	FILE
1	1479393492165	134217728	0	1179108	hdfs	60408	1479393492727	hdfs	alteryx-log-02.csv	755	3	0	FILE

We then filter out any results where the type attribute is not “FILE” and then apply the Wildcard Search via the WildcardFilter Macro. The date information for each matching file is converted from a Unix timestamp to DateTime fields (thanks to one of our Alteryx ACE’s, Michael Treadwell’s community post for providing that answer), and the results are sorted by file name.

Back to the list of HDFS Macros

Wildcard Filter

The WildcardFilter Macro filters a list of text by whether or not it matches the specified Wildcard Search criteria. Since Alteryx doesn’t natively support Wildcard filtering, the Wildcard Search text is converted into a RegEx expression which is then passed into the RegEx Tool to perform the matching. Results are then filtered by whether they matched or not.

To keep things simple, I only handle the * wildcard (for zero or more characters), but it would probably not be too difficult to support others.

Back to the list of HDFS Macros

HDFS Read Batch

The HDFSReadBatch Macro takes a list of HDFS Server and HDFS File Path information and reads each file in turn, sending the content downstream. The source file name is appended to each record before it is sent to the Macro Output so that data can be related back to their source. The SplitPath Macro is used again here to extract the file name from the full file name and path string provided as the input, and the Macro Output is set to Auto Configure by Name (Wait Until All Iterations Run) so that it will work with files that aren’t completely identical in layout.

The HDFSReadBatch Macro also supports the same File Type Options as the HDFSDynamicInput Macro, allowing the user to specify Avro or CSV, and the appropriate CSV options.

Room for Improvement

As with all things, there is always room for improvement, and I will leave that as an opportunity for others to explore. A couple suggestions that I think would be worth pursuing include:

Kerberos Support – the current implementation of these Macros won’t work with an HDFS server that is using Kerberos. The Input Tool in the HDFSReadBatch Macro can support Kerberos, and that should be just a matter of dynamically changing the Authenticate=false portion of the connection string to Authenticate=true when Kerberos support is desired. The SearchHDFS Macro would also need to be updated. Since I am using the Download Tool and this doesn’t natively support Kerberos authentication, it would probably have to be replaced with a Run Command Tool calling a third-party web utility such as curl (https://curl.haxx.se/).
Knox Support – since our Input Tool’s HDFS connection supports Knox, it would be nice for these Macros to also support it. However, it requires prompting the user for their credentials and the Knox password is encrypted in the connection string that is generated. I’m not sure how I would go about doing that through a Macro Input. We would also have the same issue as with the Kerberos Support in terms of replacing the Download Tool in the SearchHDFS Macro with something like curl.
More Wildcard Options – for simplicity sake, I just implemented the ‘*’ wildcard. It should be fairly easy to support others. It would just be a matter of converting those wildcards into the appropriate regex expressions within the SearchHDFS Macro.
HDFS Folder Recursion – it would be great if you could search an entire folder structure within HDFS, but this gets a bit complicated as WebHDFS and HTTPFS don’t support this. I could envision an Iterative Macro which started at the specified folder location and searched each sub-folder it came across, repeating that process until it has searched the entire folder structure.

Conclusion

You can download all of the Macros in this post here (All of the macros are in a single package). I hope you find them useful, but more importantly I hope I have helped demonstrate how flexible Alteryx can be, even when what you want to do isn’t immediately available out of the box. Macros are the easiest way to expand the already vast collection of Alteryx tools, and if you’re not already familiar with how to build them, I encourage you to read up on them in the help documentation or watch one of our introductory videos.

I would like to thank @DanH for his Roundtable response that prompted this post, and also @stevea who provided critical feedback on the Macros as I was working on them.

Finally, while these Macros are free for you to use, please understand that they are not officially sanctioned by Alteryx nor can they be officially supported. That said, I’m happy to respond to any questions or feedback you may have, time allowing.

Engine Works

Extending Alteryx HDFS Support: Better Living Through Macros

HDFS Dynamic Input

Split Path

Search HDFS Batch Wrapper

Search HDFS

Wildcard Filter

HDFS Read Batch

Room for Improvement

Conclusion