Engine Works

DavidHa · ‎06-25-2019

Will It Alteryx Blog-Banner.png

Welcome to a new Engine Works blog series, where I'll be exploring various data sources, data types, and modeling techniques, and answering the age-old question, "Will it Alteryx?"

I'll start this series with Presto, a technology that many customers have been asking about recently. Presto is a distributed SQL query engine that can work with many data source systems. It uses a MPP (massively parallel processing) design where SQL queries can be processed in parallel across a cluster of machines. This may sound identical to other MPP style data sources like Hadoop or Teradata, however the major difference here is Presto is not storing the data. In fact, Presto is much more flexible and pluggable as it can be configured to query multiple back end source systems through "connectors." Many organizations are adopting Presto as a single engine to query against all available data sources.

Presto can connect to many data source and BI/Reporting systems.

Environment Setup

In order to find out if Alteryx can work with Presto, I needed an environment to test. Presto follows a familiar architecture where you have one Master and multiple Workers. In Presto terms the master is called the "Coordinator" and is responsible for handling client connections and distributing work to the workers. The workers retrieve the data from the source systems defined by the connectors, and work together to process the requests directed by the Coordinator.

This fits perfectly with Hadoop architecture so I collocated the Presto Coordinator on a Hadoop NameNode and configured Presto Workers on 3 Hadoop DataNodes. After configuring the workers and coordinator to all talk to each other, I was able to validate my new Presto cluster using the Web Interface:

Presto Web UI

Connectors and Catalogs and Schemas! Oh My!

With Presto there are a couple new terms to be aware of in addition to the database schema that define how to connect to data sources. A "connector" defines how Presto connects to the back end data source system, including the server hostname, port, and credentials. A "catalog" contains schema information and references a specific connector. All of these are easily configured in properties files, and in no time at all I had connections to Hive, MongoDB, PostgreSQL, Amazon Redshift, and Microsoft SQL Server:

[root@presto-coordinator catalog]# ls -alh
...
-rw-r--r-- 1 root root 100 May 22 15:41 hive.properties
-rw-r--r-- 1 root root 82 May 22 14:38 mongodb.properties
-rw-r--r-- 1 root root 133 May 22 14:25 postgresql.properties
-rw-r--r-- 1 root root 131 May 22 15:19 redshift.properties
-rw-r--r-- 1 root root 146 May 22 14:48 sqlserver.properties

Querying Your Data

Presto includes a command line tool for running queries against specific data sources that have been configured in your Presto cluster. It can be launched against a catalog and schema to allow users to run queries interactively:

[root@presto-coordinator presto]# ./presto --catalog hive --schema default
presto:default> show tables;
            Table
--------------------------------
acmebank_stress_test
canada_transactions
cdh_customers
inspire_campaign_scored
(4 rows)

Query 20190529_124127_00007_k5hzu, FINISHED, 4 nodes
Splits: 53 total, 53 done (100.00%)
0:00 [4 rows, 513B] [71 rows/s, 2.57KB/s]

presto:default>

Or, the query can be specified directly on the command:

[root@presto-coordinator presto]# ./presto --catalog hive --schema default --execute "SELECT * FROM acmebank_stress_test WHERE loan_type='Term loan' AND collateral>=100000;"

"2008-06-30 00:00:00.000","100012891","36","32","BB","Pool5","1952021.0","Term loan","","1764250.0","112457.708875471","506059.689939621","-13899.7728170082","-13775.9730136383","-110529.743351998","","","","1764250.0","0.007967812730301","0.0","0","253029.84496981","197363.279076452","168686.563313207","0.151388441875723","267087.058579244","154629.349703773","0.087645940033313","0.26293782009994","510786.287443657","506059.689939621","0.990746428359126","0.15","47","2","WA","West","Retail Lending","9","Mortgage","Consumer","Bank B","Collateral"
...

Will it Alteryx?

With an environment up and running, it's time to answer the question on everyone's mind, "Presto, will it Alteryx?" The answer is yes, it will! Although not listed as a Supported Alteryx Data Source, Presto data can be read in through the standard Input tool using a Generic ODBC connection. To support this, the machine where Alteryx is running (server, laptop, desktop) needs the Presto ODBC Driver installed.

Additionally, an ODBC configuration must exist for each data source you intend to query. The Presto Coordinator hostname and port must be specified, along with any credentials required to make the connection:

ODBC Driver Setup

Catalog and Schema selections allow you to specify the intended data source to connect to:

ODBC Driver Setup - Catalog selection

Once the ODBC Driver has been configured, the Input tool can be easily configured to pull data from Presto into Alteryx Designer for processing:

Reading in data from Hive through Presto

While this blog is not meant to provide performance details, I did observe consistently faster results reading a test table from Presto (~12 seconds) compared to reading from Hive (14 sec) or Impala (24 sec). Additionally, I was able to use the In-Database tools with Presto by specifying a Generic ODBC Connection and the processing completed in under 3 seconds!

In-Database workflow using Presto

Final Thoughts

Alteryx working with Presto opens up many opportunities for data analysis as Presto continues to adopt more connectors to other data sources. Currently not listed as a supported data source for the standard input tools or In-Database tools, Presto can still be queried using a generic ODBC connection. With Presto's growing popularity it is likely to be considered as a future addition to the supported data sources list.

If you have any technologies you would like to see explored in future installments of the "Will it Alteryx" series, please leave a comment below!

Fun Facts

"presto" is Italian for quickly.
Presto was originally developed at Facebook to provide faster queries against data in Hive.

References