Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
Check out our powerful new search update! You can read more here. Please let us know if you have any feedback by creating a topic here.

Apache Spark on Databricks: read parquet files with Alteryx

Asteroid

Is it possible to read parquet files from Azure Spark on Databricks? I'm storing my files in a container in my Azure storage account and, using those files, have created a table in Azure Databricks. I can see the parquet files in Alteryx, but I don't know how to read them. 

 

I tried (and failed) to read them in using approach 3) from this article https://community.alteryx.com/t5/Engine-Works-Blog/Parquet-will-it-Alteryx/ba-p/423156.

I also had a look at this article, but didn't find it helpful: https://community.alteryx.com/t5/Alteryx-Designer-Knowledge-Base/Access-Data-in-Parquet-Format/ta-p/...

 

If I can connect Alteryx to Databricks surely there must be a way to read the data?

 

Luisa_Bez_1-1580158560581.png

 

Alternatively, I have uploaded two of my files directly to Databricks as csv. I can read those, but storage space on my cluster is limited so that's not what I'm going for.

 

Luisa_Bez_2-1580158843648.png

 

Highlighted
Alteryx
Alteryx

You can't read a Parquet file directly from the in wizard, but you can use the Spark Direct Code Tool in standalone mode to read in a parquet file:

 

 

spark.read.parquet('...parquet')

 

Highlighted
Asteroid

Thanks @AndrewKramer!

 

I used the below code and get a DBFS error: DBFS path must be specified.

 

spark.read.parquet("dbfs/user/hive/warehouse/taxi_perm_table/part-00000-tid-347647642471467172-70784db0-62ea-45e9-9a39-fa9b1e196094-163-1-c000.snappy.parquet")

 

 

Isn't that my entire path?

 

Edit: I decided to create csv files instead. Each Connect in-DB tool allows me to select one csv file, but Databricks splits them into multiple files once I write them to a table. What's the most efficient way to read in all of those files at once (and subsequently query the original file as I assume it was split into smaller parts)?

 

Luisa_Bez_0-1580166361766.png

 

 

Highlighted
Alteryx
Alteryx

Are you just trying to read the Databricks table into Alteryx?

 

If so, I would recommend using the ODBC Driver for Databricks (Simba Spark ODBC Driver):

https://pages.alteryx.com/Alteryx-Driver-Downloads-LP.html

 

In my experience, this is much faster than the Spark In-Database Connector for Azure Databricks. This driver supports reading the partitioned Databricks tables that you have.

Highlighted
Asteroid

Thanks for your help.

 

Yes, I'm trying to read a Databricks table into Alteryx. 

 

I have downloaded and installed the driver, but where do I go from there? The only way I managed to connect to Databricks was through Apache Spark on Databricks with an in-DB tool. 

 

Luisa_Bez_0-1580204501260.png

Highlighted
Alteryx
Alteryx

In your ODBC Manager, you'll need to configure the Simba Spark ODBC Driver to create a DSN.

AndrewKramer_0-1580223544986.png

 

For Databricks, the user name is 'token' and your password is your API token.

 

From there, with the Input Data Tool, click on the Spark ODBC Connection

AndrewKramer_2-1580223673048.png

 

Once you click on the connection, select your DSN

AndrewKramer_3-1580223726706.png

Highlighted
Asteroid

Hi. Great guidance so far. Thanks! I'm trying to do similar and was able to create the ODBC connection and test it. However, do you know what I need to put in the "Databricks URL" field when creating the "write driver" in the Alteryx In-DB Connection? Unfortunately you seem to need to set up the Write tab even when just trying to Read.

 

coderockride_1-1580314357827.png

coderockride_4-1580314553920.png

coderockride_2-1580314479737.png

Highlighted
Alteryx
Alteryx

The Databricks URL is the host that you specified when configuring the ODBC Driver.

 

For you, this appears to be

uksouth.azuredatabricks.net
Highlighted
Asteroid

(facepalm). It's working now. It was actually an SSL issue I had to resolve in the end, but it was comforting knowing which issue I needed for focus on.

Labels