Alteryx Designer Knowledge Base

Definitive answers from Designer experts.

Apache Spark Deep Dive

Alteryx Alumni (Retired)
Alteryx has the ability to connect and utilize Apache Spark as the database engine for multiple data sources. This article will describe the connection methods and how they differentiate from each other.

Apache Spark Direct:
This connection type communicates with Apache Spark using Livy. Livy is a service that allows communication to Spark over a REST interface. This connection type will require Livy - be sure to discuss with your admin which configuration your Hadoop instance supports.


This functionality is available within the following tools for Read and Write:

-Connect In-DB
-Data Stream In
-Apache Spark Code

When configuring connections inside of these tools, you will have to enter the Livy information, as well as the HDFS connection information. The connection is the same for all tools, as well as read and write:


Apache Spark ODBC:
This connection type communicates with Apache Spark using ODBC. Alteryx sends commands to an ODBC driver which then communicates with a Thrift server. The Thrift server then works with Spark to process data from HDFS. Utilizing ODBC requires a Thrift server - discuss with your admin if this is currently implemented in your Hadoop instance.


This functionality is available within the following tools:

-Input Data
-Output Data
-Connect In-DB

Creating a read connection (Input Data or Read within an In-DB connction) or write connection with the Output Data tool requires the user to setup a DSN using the Simba Spark ODBC driver. Information about the Spark instance, as well as Thrift Server information will be placed inside of the DSN configuration:

The Data stream in tool (In-DB Write) utilizes bulk writing functionality to load files directly HDFS. This is configured from the "write" section when creating an In-DB connection:

No ratings
17 - Castor
17 - Castor

For some reason, the images are not showing up anymore.