Note: This is Part 3 of the “Ins and Outs of In-DB" series. Check out Part 1 and Part 2 for more.
Welcome back to the Ins and Outs of In-DB! In this post we’re focusing on how you can use Alteryx in-database (in-DB) tools with Databricks. Let’s dig in.
First off, why use Alteryx with Databricks?
Databricks pioneered the data lakehouse, which puts the data warehouse (structured tables) and the data lake (unstructured files) together into one platform – so you have one place to put different data types. Perfect for analyzing all kinds of data in Alteryx!
Databricks is also a powerful end-to-end platform for machine learning and AI. You can use Alteryx with Databricks to do some really robust, code-first AI/ML workloads. Even if you’re not in charge of building models, you can use Alteryx to collaborate on the data lifecycle with other folks in your org who primarily use Databricks.
You can connect Designer with Databricks using the ODBC connector to start reading and writing from Databricks. But the fun doesn’t stop there. Of course (because this is the Ins and Outs of In-DB series) Alteryx offers in-database pushdown processing with Databricks.
Many Alteryx users who also use Databricks are leveraging in-DB, because if you’re using Alteryx with Databricks, you’re probably doing heavy-duty data processing to get it ready for some serious data science and AI workloads with both Alteryx and Databricks.
But to do the fun data science stuff, you need to get your data ready first.
Sounds simple enough, but this is where people hit a lot of snags.
You have to 1) understand the business problem at hand, 2) explore the data you have, 3) get it prepped and formatted for model training, and 4) do this all quickly. Fear not: you can use Designer with Databricks to get it done.
Here’s an example of how Alteryx fits into a data lifecycle with Databricks, with examples of what you can do at each step:
Now that you’ve seen how Alteryx + Databricks work in the grand scheme of things, let’s talk about how in-DB helps.
When you use in-DB, you process the data directly within Databricks. The data doesn’t have to leave the lakehouse. If I lived in a lakehouse I sure wouldn’t want to leave. Much like data, I would make it more expensive and inefficient to get me out of the lakehouse. So don’t make your data leave the lakehouse!
In-DB is especially useful for processing large amounts of data and feeding a beast like a machine learning model. When you do your prep and blend in-database, you’re using Databricks’ powerful resources instead of making your device do all the work – speeding up your workflows. So, it makes sense to use in-DB for the data prep stage of the ML lifecycle.
Set up an in-database connection to Databricks by dropping the In-DB Connect Tool on the canvas and adding Databricks as a data source (make sure you have Databricks access and the ODBC driver).
Then get cracking on your business problem – say, segmenting customers to train an ML model that targets customers with personalized product assortments based on ZIP code.
You’ll need to segment those customers by ZIP code, filter out the ZIP codes you don’t want, and find the number of customers per ZIP code. Once you’ve done that, you’ll write the results to a table to make it available for use.
You can do this all directly in Databricks with the in-DB tools. No leaving the lakehouse.
Ready to get going? Learn more about Alteryx and Databricks and get a hands-on demo here.
As always, happy pushing!
Alex
Resources: