Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Knowledge Base

Definitive answers from Designer Desktop experts.

Connecting to Hadoop

HenrietteH
Alteryx
Alteryx
Created

Connecting to Hadoop

 

In this article, we will go over how to connect to the various flavors of Hadoop in Alteryx.

 

To use a Saved Data Connection to connect to a database, use the "Saved Data Connections" option in the Input Data Tool and then navigate to the connection you wish to use:

 

2-14-2017 2-56-58 PM.png

 

Note: Alteryx versions ≥ 11.0 and include functionality to manage data connections through the Gallery as well as through Options > Advanced Options > Manage Data Connections. Aliases created in prior versions will be carried forward as connections stored on the local computer. 

  

 Creating a new connection to HDFS

 

1. Bring an Input Tool to the canvas and navigate to the Hadoop option.

Select the little arrow under "Connect a File or Database", then Hadoop. 

 

2-14-2017 3-03-38 PM.png 

 

2. Enter the necessary information to connect

Your HDFS/Hadoop Admin/IT contact will need to provide the hostname to use as well as the connection method (HTTPFS, WebHDFS, or Knox Gateway), and the username and password (if needed). Alteryx provides the default port. If your company uses a different port, you will need to edit this here:

 

2-14-2017 3-04-41 PM.png

Alteryx uses the host and port to generate the URL to be used for the connection. You also have the option to edit the URL should this be required. 

Clicking "Test" will test the connection to make sure it works.

 

3.  Click OK and select the file to be read in

Clicking "OK" will bring up a new window with all files available in HDFS (Note: Alteryx currently supports .csv and .avro files in HDFS). You can either enter the file name manually or use the browse functionality to browse to the file.

2-14-2017 3-18-16 PM.png

Creating a Saved Data Connection for HDFS

Once you have successfully established a connection to HDFS, you should save the connection for future use. The easiest way to do that is as follows: 

 

1. Copy the connection string now visible in the Input Tool 

2-14-2017 3-24-04 PM.png

  

2. Open the Data Connections Manager

Go to Options > Advanced Options > Manage Data Connections and select Add Connection > Other

2-14-2017 3-27-31 PM.png

  

3. Enter a connection name and connection string and hit save

2-14-2017 3-32-25 PM.png

  

4. The HDFS connection will now be available in both Input and Output Tools to use under Saved Data Connections

   

Creating a new connection to Hive/Impala/Spark

To connect to Hive/Impala/Spark an ODBC driver is required which can be downloaded from the Alteryx drivers download page

Once the driver has been downloaded, a new DSN needs to be set up through the Windows ODBC Data Sources Administrator (Control Panel > System and Security > Administrative Tools > Data Sources (ODBC)). Your HDFS/Hadoop Admin/IT contact might need to provide necessary hostnames/user credentials/etc. to connect. 

Once a DSN has been set up and tested through the Data Source Administrator, you are ready to connect through Alteryx. 

 

1. Add a new Input Tool to the canvas and browse to the ODBC connections.

Select the little drop down as you did for the HDFS connection, then go to Other Databases > ODBC2-14-2017 3-46-16 PM.png

  

2. Select the Data Source Name

In the ODBC connection window, select the DSN that was set up through the Windows ODBC Data Sources Administrator and enter username/password as needed

2-14-2017 3-56-37 PM.png 

 3. Enter/Build your query

 Use the Tables, Visual Query Builder, or SQL Editor tabs to create the query reading in the data from Hadoop to Alteryx. 

Note: As a new feature in Alteryx 11 the Visual Query Builder does not load automatically for all databases because it can be slow to load (e.g. large schemas in Hive). Click the "Load" button to load tables into the Visual Query Builder. 

If the Visual Query Builder is blank after clicking the "Load" button, try checking "Get Tables with Query" in the Advanced options in your ODBC DSN. With the option checked, the driver uses the SHOW TABLES query to retrieve table names. Without it checked, the driver uses the GetTables Thrift API call to retrieve table names.

2-14-2017 4-12-59 PM.png 

Creating a Saved Data Connection for ODBC connections to Hive/Impala/Spark

 

1. Open the Data Connections Manager

Go to Options > Advanced Options > Manage Data Connections and select Add Connection > Other

2-14-2017 3-27-31 PM.png

 

 

2. Enter a connection name and select ODBC as the connection type

2-14-2017 4-33-36 PM.png

 

 

3. Select the Data Source Name

In the ODBC connection window, select the DSN that was set up through the Windows ODBC Data Sources Administrator and enter username/password as needed

2-14-2017 4-37-22 PM.png

 

 

4. The connection will now be available in both Input and Output tools to use under Saved Data Connections

Comments
aser
7 - Meteor

Thank you for this tutorial.

It works fine for me in the intranet but if I connect from AWS Cloud I get an error.

Is there a way to access the services via a proxy because I cant reach the intranet?

 

The internet I reached from the aws only about a proxy.

 

The error message.

Error: Output Data (7): Failed to retrieve upload redirect URL (myserver.de:50070

 

<div id="titles">
<h1>ERROR</h1>
<h2>The requested URL could not be retrieved</h2>
</div>
<hr>

<div id="content">
<p>The following error was encountered while trying to retrieve the URL: <a href="http://myserver.de:50070/webhdfs/v1/table?">http://myserver:50070/webhdfs/v1/table?</a></p>

<blockquote id="error">
<p><b>Unable to determine IP address from host name <q>myserver.de</q></b></p>
</blockquote>

<p>The DNS server returned:</p>
<blockquote id="data">
<pre>Name Error: The domain name does not exist.</pre>
</blockquote>

<p>This means that the cache was not able to resolve the hostname presented in the URL. Check if the address is correct.</p>

abdj
5 - Atom

Hi,

I was able to connect Alteryx with my Hive tables. The HIVE tables are on Hortonworks 2.4 hadoop environment and Alteryx is in my local desktop system. I am using-

ODBC driver

Keytab file for authentication and

MIT kerberos

 

The problem is that the speed is very low. It is taking almost 1.5 hours to load data of around 2 GB in Alteryx from HIVE.

I also connected to corresponding HDFS file as detailed above by Henrietteh, but the speed issue is still there.

Can Henrietteh or someone help with this issue. Thank You.

archanbhatt
7 - Meteor

We also have a similar problem as @abdj where we can connect to Hive but data loading takes a lot of time. Query builder is also not working.

We have "FastSQLPrepare" property set in our ODBC connection. Do you recommend any specific version of ODBC? 

FYI: I am using Alteryx 11.0.5 and Hive 1.2 and ODBC driver 2.0.5.10005(64 bit)

 

Thanks in advance,

AB

eliasg
7 - Meteor
@archanbhatt were you able to get past this issue? im having the same exact problem...
archanbhatt
7 - Meteor

Nope @eliasg, It works but it is very very slow. I have no idea. It seems a problem in ODBC driver since it works fine on sql server.

Diogenes
5 - Atom

Hi

@aser , Did you fix it?