Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.
DavidHa
Alteryx
Alteryx

dremio.png

 

Welcome back to another entry in the "will it Alteryx?" series. In this installment, we'll look at Dremio. We'll begin with an introduction to Dremio and then explore how Alteryx can integrate with this technology. So as we enter this holiday season, grab a cup of hot chocolate, curl up by the fire, and let's find out, Dremio, will it Alteryx?

 

 

What is Dremio?

 

Data Lake technology has grown rapidly over the last decade as a cost efficient way to store massive amounts of data. Common examples include HDFS and S3. However, being able to quickly and easily query and analyze data in data lakes has been a common problem. Dremio attempts to solve that problem by providing a cloud based data lake query engine service. 

 

Dremio can be deployed in AWS, Azure, or on-premise. Deployments can be on stand alone machines, collocated with Hadoop clusters, or even as a container environment managed by Kubernetes. Dremio allows users to define clusters of machines which can elastically scale to meet any data volume or workload demands.  

 

An example Dremio cluster, with a Coordinator, and multiple Executors.An example Dremio cluster, with a Coordinator, and multiple Executors.

 

 

Dremio's goal is to "deliver lightning-fast queries and a self-service semantic layer directly on your cloud data lake storage." Therefore, Dremio is simply acting as the execution engine against the data lake storage. No copies of data are being made. These executions are based on Apache Arrow which compiles and executes queries extremely fast. Data can be queried from a number of data lake solutions, as well as other external databases. In this way, Dremio can be thought of as a data federation solution. Results can be retrieved from Dremio to analytic and reporting solutions over ODBC, JDBC, Rest API, and Arrow Flight. 

 

Dremio integrates with various data storage, analytics, and reporting solutions.Dremio integrates with various data storage, analytics, and reporting solutions.

 

To determine if Alteryx could integrate with Dremio, I first had to setup a Dremio cluster. This was extremely easy to accomplish with an AWS CloudFormation template to create your own environment. Within a matter of minutes, I had a cluster consisting of a Coordinator and four Executors.

 

A list of Dremio machines in the cluster.A list of Dremio machines in the cluster.

 

 

I then reviewed the list of external data sources and data lakes that could be connected, and configured Dremio to use my AWS S3 account.

 

DataLake.png

 

 

With a Dremio cluster at my fingertips, it was time to answer the age old question, Dremio, will it Alteryx?

 

 

 

Will it Alteryx?

 

Dremio supports connectivity via two methods that will work with Alteryx, Rest API and ODBC, using their provided ODBC Driver. While Alteryx doesn't list Dremio as a fully supported data source, a generic ODBC connection works without issue.

 

First, install and configure the Dremio ODBC Driver. Their standard ODBC port is 31010. A Dremio administrator can provide credentials to connect.

 

An example ODBC configuration for Dremio.An example ODBC configuration for Dremio.

 

Next, in Designer, use an Input Tool with a Data Source -> Generic ODBC Connection. Specify the Dremio DSN you previously configured and required credentials:

 

ODBC2.png

 

Assuming everything works correctly, you should be prompted to choose an existing Table. Some sample data sets from Dremio will be shown, as well as any tables that you have configured or been given access to from your administrator. In the example below, some datasets from my AWS S3 account are available. As mentioned previously, Dremio is not actually storing these data assets, there are simply providing an execution engine to query against them.

 

An example listing of tables accessible from Dremio.An example listing of tables accessible from Dremio.

 

 

After selecting a table, you are then able to build an Alteryx workflow to analyze and enrich that data. In the example below I was able to join together a table from Dremio and a local file for some further analysis.

 

A sample Alteryx workflow joining together a local file with data from Dremio.A sample Alteryx workflow joining together a local file with data from Dremio.

 

 

Pulling data from Dremio and analyzing it locally with Designer performed quite well. However, Dremio was designed to provide an optimized execution engine for querying data in the cloud using a scalable distributed processing environment. To take advantage of those benefits, we need to use In-Database processing. With In-Database processing, we can push the Alteryx processing logic to execute directly inside the Dremio cluster, taking advantage of the compute power of the Dremio environment.

 

First, we need to configure an In-DB connection. This is easily accomplished with our previous ODBC configuration. 

 

InDB.png

 

 

Next, we can build an In-Database workflow using the Connect In-DB Tool and subsequent In-Database Tools. This allows processing of data sets directly in Dremio that complete in a fraction of the time.

 

Workflow2.png

 

One additional note, I was unable to successfully write data from Alteryx back to Dremio as it looks like they have yet to release this capability, as part of their "iceberg tables feature". If you need the ability to write data back to Dremio you will want to stay tuned to Dremio release updates for news on that feature.

 

 

Final Thoughts

 

Dremio provides a robust and scalable cloud data lake query engine, capable of ingesting data assets from various sources. Alteryx can integrate with Dremio using its generic ODBC connection support, as well as Rest API connectivity. This provides a unique approach at prepping, blending, and enriching data assets stored in the cloud. 

 

If you have other thoughts on how to leverage Dremio with Alteryx please leave a comment below!

 

David Hare
Manager, Solutions Architecture

David is a manager of the Solutions Architecture team helping customers understand the Alteryx platform, how it integrates with their existing IT infrastructure, and how Alteryx can provide high performance and advanced analytics. He's passionate about learning new technologies and recognizing how they can be leveraged to solve organizations' business problems.

David is a manager of the Solutions Architecture team helping customers understand the Alteryx platform, how it integrates with their existing IT infrastructure, and how Alteryx can provide high performance and advanced analytics. He's passionate about learning new technologies and recognizing how they can be leveraged to solve organizations' business problems.