Welcome back to another entry in the "will it Alteryx?" series. In this installment, we'll look at Dremio. We'll begin with an introduction to Dremio and then explore how Alteryx can integrate with this technology. So as we enter this holiday season, grab a cup of hot chocolate, curl up by the fire, and let's find out, Dremio, will it Alteryx?
Data Lake technology has grown rapidly over the last decade as a cost efficient way to store massive amounts of data. Common examples include HDFS and S3. However, being able to quickly and easily query and analyze data in data lakes has been a common problem. Dremio attempts to solve that problem by providing a cloud based data lake query engine service.
Dremio can be deployed in AWS, Azure, or on-premise. Deployments can be on stand alone machines, collocated with Hadoop clusters, or even as a container environment managed by Kubernetes. Dremio allows users to define clusters of machines which can elastically scale to meet any data volume or workload demands.
Dremio's goal is to "deliver lightning-fast queries and a self-service semantic layer directly on your cloud data lake storage." Therefore, Dremio is simply acting as the execution engine against the data lake storage. No copies of data are being made. These executions are based on Apache Arrow which compiles and executes queries extremely fast. Data can be queried from a number of data lake solutions, as well as other external databases. In this way, Dremio can be thought of as a data federation solution. Results can be retrieved from Dremio to analytic and reporting solutions over ODBC, JDBC, Rest API, and Arrow Flight.
To determine if Alteryx could integrate with Dremio, I first had to setup a Dremio cluster. This was extremely easy to accomplish with an AWS CloudFormation template to create your own environment. Within a matter of minutes, I had a cluster consisting of a Coordinator and four Executors.
I then reviewed the list of external data sources and data lakes that could be connected, and configured Dremio to use my AWS S3 account.
With a Dremio cluster at my fingertips, it was time to answer the age old question, Dremio, will it Alteryx?
Dremio supports connectivity via two methods that will work with Alteryx, Rest API and ODBC, using their provided ODBC Driver. While Alteryx doesn't list Dremio as a fully supported data source, a generic ODBC connection works without issue.
First, install and configure the Dremio ODBC Driver. Their standard ODBC port is 31010. A Dremio administrator can provide credentials to connect.
Next, in Designer, use an Input Tool with a Data Source -> Generic ODBC Connection. Specify the Dremio DSN you previously configured and required credentials:
Assuming everything works correctly, you should be prompted to choose an existing Table. Some sample data sets from Dremio will be shown, as well as any tables that you have configured or been given access to from your administrator. In the example below, some datasets from my AWS S3 account are available. As mentioned previously, Dremio is not actually storing these data assets, there are simply providing an execution engine to query against them.
After selecting a table, you are then able to build an Alteryx workflow to analyze and enrich that data. In the example below I was able to join together a table from Dremio and a local file for some further analysis.
Pulling data from Dremio and analyzing it locally with Designer performed quite well. However, Dremio was designed to provide an optimized execution engine for querying data in the cloud using a scalable distributed processing environment. To take advantage of those benefits, we need to use In-Database processing. With In-Database processing, we can push the Alteryx processing logic to execute directly inside the Dremio cluster, taking advantage of the compute power of the Dremio environment.
First, we need to configure an In-DB connection. This is easily accomplished with our previous ODBC configuration.
Next, we can build an In-Database workflow using the Connect In-DB Tool and subsequent In-Database Tools. This allows processing of data sets directly in Dremio that complete in a fraction of the time.
One additional note, I was unable to successfully write data from Alteryx back to Dremio as it looks like they have yet to release this capability, as part of their "iceberg tables feature". If you need the ability to write data back to Dremio you will want to stay tuned to Dremio release updates for news on that feature.
Dremio provides a robust and scalable cloud data lake query engine, capable of ingesting data assets from various sources. Alteryx can integrate with Dremio using its generic ODBC connection support, as well as Rest API connectivity. This provides a unique approach at prepping, blending, and enriching data assets stored in the cloud.
If you have other thoughts on how to leverage Dremio with Alteryx please leave a comment below!
David has the privilege to lead the Alteryx Solutions Architecture team helping customers understand the Alteryx platform, how it integrates with their existing IT infrastructure and technology stack, and how Alteryx can provide high performance and advanced analytics. He's passionate about learning new technologies and recognizing how they can be leveraged to solve organizations' business problems.
David has the privilege to lead the Alteryx Solutions Architecture team helping customers understand the Alteryx platform, how it integrates with their existing IT infrastructure and technology stack, and how Alteryx can provide high performance and advanced analytics. He's passionate about learning new technologies and recognizing how they can be leveraged to solve organizations' business problems.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.