Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Understanding the Databricks Connection in Alteryx

kara_mills
7 - Meteor

I am a long time Alteryx user and am now working some in Databricks as a good portion of our data resides in our data lake. That seems to be the best place to access the data without dealing with egress costs or having to push/pull data back and forth. However, while I love learning new tools, for sake of time, I would prefer to use Alteryx. Before I connect Alteryx to our Azure Databrick environment, I was hoping to hear from others on their experiences, especially if your Alteryx license is local while the data you connect to is a data lake. 

 

  • When you connect to the data in Databricks, do you experience an egress cost or is it "running" in Databricks but using the actions you designate in Alteryx?
  • Are there any limitations to the size of the data that can be used? 
  • Any other unexpected challenges to be aware of before connecting?

I don't want to be the person to connect, love it, but then have to disconnect because I'm pulling too much data and it's costing a fortune. 😉

3 REPLIES 3
DanM
Alteryx Community Team
Alteryx Community Team

@kara_mills 

 

When using databricks you have a few options when it comes to egress cost:

 

You can use a normal input tool to connect to the database using a 64bit driver. When you connect and build your workflow you are actually pulling the data from the database and into Alteryx using this method. You can also cache the data from the Input tool by right clicking on the Input tool and choosing "run and cache". By caching the data it will increase your time to build. Just don't forget to remove the caching when ready to productionize.

 

The 2nd option is to use the In-Database tools. These tools do all the work in the database without pulling it into Alteryx. These tools allow for much faster building as well as when you do need to then pull the data into Alteryx to use other tools not supported in the In-DB category, your data is at a point that there isn't much cost to pulling it into Alteryx.

 

Regarding limitation; there is only a limitation on what your machine can handle. If you are using millions of rows of data, the In-DB tools will more than likely be a better place to start since you will hopefully be able to get the data to a more reasonable level. 

 

For your last question. There can be connection issues in the beginning but if you set the connection up correctly, other than updating your password to the database everything should run smoothly.

 

Hope this helps.

DanM

kara_mills
7 - Meteor

Thank you @DanM! I did peek at those tools and I'm super curious about them - I think that might be a huge help in what we need.

 

One question about your comment: These tools do all the work in the database without pulling it into Alteryx. Are you saying the commands I'm creating in Alteryx are being executed in that tool itself, Databricks in this case, thereby not pulling the data into Alteryx at all? I was confused about that piece in the other documentation I was looking at as well. Does the tool funnel my commands to Databricks to execute and then output the results back to Alteryx? 

 

Thank you for helping me wrap my head around this concept! 🙂

DanM
Alteryx Community Team
Alteryx Community Team

@kara_mills .

 

That is correct. Basically what the In-DB tools are doing is create a statement to eventually pull the data into Alteryx. Basically all the work is being done in the database and once you use a data stream out tool, it then brings it into Alteryx. The idea is to make it easier for someone who doesn't want to or have the knowledge to create database statements. I use the In-DB tools all the time. They are easier and faster to use than creating for example SQL statements to pull data into Alteryx.

Labels