Alteryx can work with data in Hadoop in multiple ways, including HDFS, Hive, Impala, and Spark. Many of these offer multiple configuration options. The goal of this article is to present the various options available at a high level, note some performance observations between them, and encourage you to perform similar analyses to understand what works best in your environment.
HDFS is built following the WORM pattern. (Write Once, Read Many). It's not designed or optimized for random writes or selected reads. Instead, HDFS works best when very large data files are written once, then read completely. When writing data to HDFS with Alteryx, you have the option to write in CSV or Avro file formats. CSV has the obvious limitation that all data gets converted to Strings, but it’s easy to work with. Avro comes with a couple of output options, including Enabling Compression, and Supporting Null Values.
We can safely assume that enabling compression will take longer to WRITE a file to HDFS, but how much longer? Is the reduced footprint on disk worth it? What impact does it have on subsequent READ requests?
I timed how long it took to read and write several datasets using both CSV, Avro with Compression, and Avro without Compression. Since my datasets didn’t have null values, I didn’t use the Support Null values setting. I have the total times listed below, as well as the disk footprint size (in MB) of one of the data sets after it was written to HDFS. This particular data set was 9.1MB in size on Windows (.yxdb).
(Note: All tests were repeated multiple times to ensure repeatable & consistent results).
The first observation is that CSV is the worst performer in all three categories. Combine those results with the fact that CSV converts all data types to Strings, and Avro is the preferred file format in HDFS. Looking at the Avro results specifically, we can see that Compression was able to reduce the data set size on disk down to ~40% of the uncompressed size. While it did take 3x as long to complete the WRITE to HDFS, the subsequent READS completed in less than a third of the time. In most situations, the increased WRITE time is well worth the disk footprint reduction and shorter READ times. So the recommendation here is to go with Avro with Compression enabled.
When setting up your HDFS connection, you have the option of choosing either HTTPFS or WebHDFS.
Both are REST endpoints providing READ and WRITE access to data in HDFS. HTTPFS provides this service where one node acts as a gateway, by which all data will be transferred to/from the client. This is recommended for secure environments where the cluster is protected or behind a firewall, and all clients must interact with this HTTPFS ‘gateway’. Conversely, WebHDFS works by allowing clients to interact with all the Data Nodes directly. This is less secure than the HTTPFS approach but should produce faster I/O rates. In my tests, the results came out within 1% of each other. I suspect with larger clusters and data set sizes WebHDFS would be faster.
Most Cloudera Hadoop clusters include both Hive and Impala which allow SQL access to data in the Hive metastore. Hive uses MapReduce & YARN behind the scenes, and is typically used for larger batch processing. Impala uses daemon processes and is better suited to interactive data analysis. Alteryx can leverage either one as long as the appropriate ODBC driver is installed and configured.
To compare them I measured READ and WRITE times with the same 9.1MB data set. Results are below:
While the read times are very similar, writing to Hive is much faster than through Impala. But can we improve on that 13 minutes?
When using the Data Stream In (or Write Data In-DB) tool, you have the option in your In-DB Connection to specify the Write Driver to use. With Hive you can use Hive ODBC, HDFS (Avro), or HDFS (CSV).
From the earlier test results, I’m going to eliminate CSV from consideration. HDFS (Avro) works by first writing the table into HDFS (by default to /tmp, but that can be changed), then loading the table into Hive from HDFS. We already know that Hive ODBC takes 13 minutes. Let’s compare that to using HDFS (Avro) as the Write Driver.
That’s not a misprint. Writing the 9.1 MB table to HDFS, then lifting into Hive took only 3 seconds. Even if you are not using In-Database tools in your workflow, this is the recommended way to WRITE data into Hive. This recommendation has been well documented by our friend @HenrietteH in the How to Write to Hive Faster article.
The final comparison I wanted to evaluate was In-Database performance of using Hive (MapReduce & YARN), Impala (daemon processes), and Spark. Spark uses RDD (Resilient Distributed Datasets) to keep data in memory, reducing I/O, and therefore providing faster analysis than traditional MapReduce jobs. Spark can be configured a couple of different ways, but in my cluster, Spark is configured to use YARN. (Note: YARN and Impala were tuned to use the same amount of system resources).
My workflow was simple, just some selects, a Filter, and some Summarizations:
The results were as follows:
Again, these results were repeatable and consistent. Impala was the clear winner, backing up the advertisement that it’s great for interactive data analysis. It would be interesting to see results if the data size was considerably larger, as most of the Spark time is spent creating new sessions and java processes such as the SparkContext and Executors. Impala shines since its daemons are already running, waiting for work to do.
Hopefully, this has helped you understand the various ways you can work with data in Hadoop and some of the performance considerations around which method you choose. It’s always important to evaluate these options in your environment, with your data, and your workflows as each organization will be different.
David has the privilege to lead the Alteryx Solutions Architecture team helping customers understand the Alteryx platform, how it integrates with their existing IT infrastructure and technology stack, and how Alteryx can provide high performance and advanced analytics. He's passionate about learning new technologies and recognizing how they can be leveraged to solve organizations' business problems.
David has the privilege to lead the Alteryx Solutions Architecture team helping customers understand the Alteryx platform, how it integrates with their existing IT infrastructure and technology stack, and how Alteryx can provide high performance and advanced analytics. He's passionate about learning new technologies and recognizing how they can be leveraged to solve organizations' business problems.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.