Basic In-DB Configuration for Impala and Spark
Let’s look at a fairly typical Hadoop use case with the following sample workflow. Here, we are analyzing a web site activity by product category and customer age groups from three datasets: clickstream data and registered customers dumped in Hadoop from transactional systems as well as a lookup flat file where product categories are manually maintained.
To begin, you will first need to ensure that the associated ODBC drivers are obtained and installed. Check the Alteryx Technical Specification page or the Alteryx Help for details. If you are having troubles with the connection, the first thing to check is the ‘bitness’ of the drivers i.e., the version of the driver installed must match the ‘bitness’ of the Alteryx version installed. (32-bit drivers will not establish an in-database connection on 64-bit Alteryx and vice versa.) Like with regular Alteryx tools, you can define your ODBC DSN using the ODBC Administrator; sample DSNs that come with Simba drivers for Impala and Spark are handy for reference.
Next, you will want to setup your in-database connection in Alteryx. With Alteryx version 10, there is a new Alias Manager for In-DB connections which can be accessed via the ‘Options’ menu as shown below:
In the sample workflow, the first two data sets, weblogs and registred users, are dumped into the Hadoop system and exposed via tables in the HIVE Metastore. You can then use Impala or Spark SQL to access them (Alteryx does not support in-database HIVE). To point to the relevant tables inside the data platform, in this case “registered Users and “Weblogs”, you will set up your connections using the In-DB Alias Manager as shown below:
The ‘read’ part of the connection will always be via either the Impala or Spark ODBC driver, the ‘write’ part of the connection is used to load data in a permanent or temporary table. For Impala, the ODBC driver can write using the ODBC connection. Alteryx can also write the data to HDFS in an AVRO file, then create the associated artifact in the HIVE metastore so that it becomes available as a table. For bulk load to work, the Hadoop platform needs to have the WebHDFS or HTTPFS service running. The screenshot below shows the configuration parameters.
For Spark, the ODBC driver currently available is read-only. As the Spark ODBC driver evolves to allow this functionality, Alteryx will support it when it becomes available. At the moment only the ‘bulk loader ’ is available and uses the same principle as described above for Impala. The configuration is slightly different as the Name Node URI is required in addition to others and the URI needs to include the port number.
Note: in Alteryx Analytics v10.0 Kerberos is supported with the ODBC drivers, but not with the HDFS bulk loaders, it is in progress. If you need to write data to your Hadoop platform you will want to use the Impala ODBC driver.
Once the connection is established, all in-database tools work exactly like all of the other Alteryx tools. A full list of the Alteryx’s in-database tools and their description can be found in the Alteryx in-database online help.
For in-database blending, Alteryx converts the workflow in a SQL statement that is passed to the platform via the ODBC driver. This design allows Alteryx to fully leverage the power and improvements of the platform SQL optimizer. Details are available in this knowledge article.
As shown below from a recent Webinar by Michael Armburst of Databricks, the use of DataFrames with Spark SQL is as efficient as manipulating them with R, Python or Scala.
Advanced Analytics with In-Database
While almost all of the Alteryx in-database tools work as a drag-drop-configure basis, the Formula In-DB tool comes handy when more advanced capabilities are needed.
This tool accepts SQL formula that can be as simple or as complex as needed to achieve the desired outcome. For example, the above workflow uses the following Impala SQL formula:
substr(`cookie`, 2, length(`cookie`)-2) to extract an ID from the cookie.
floor(datediff(now(),cast(unix_timestamp(`birth_dt`, 'dd-MM-yyyy') as TIMESTAMP))/365.25) to calculate the approximate age of the customer
Check the SQL functions reference for your platform: Impala’s documentation or HIVE QL documentation for Spark SQL which should accept most HQL functions.
In addition to broadening the in-database platform support, Alteryx v10.0 also delivers in-database advanced analytics in the form of Linear Regression, Logistic Regression and Scoring for Oracle R Enterprise, with more tools to come.
In parallel, Alteryx has invested in Spark R as a way to provide an advanced analytic engine on Hadoop platforms. In the coming months, as new versions of Spark are released, Alteryx will be able to deliver in-database predictive tools for Hadoop based on Spark R.