on 02-15-2017 04:44 PM - edited on 05-21-2019 08:06 AM by SydneyF
In this article, we will go over how to connect to the various flavors of Hadoop in Alteryx.
To use a Saved Data Connection to connect to a database, use the "Saved Data Connections" option in the Input Data Tool and then navigate to the connection you wish to use:
Note: Alteryx versions ≥ 11.0 and include functionality to manage data connections through the Gallery as well as through Options > Advanced Options > Manage Data Connections. Aliases created in prior versions will be carried forward as connections stored on the local computer.
1. Bring an Input Tool to the canvas and navigate to the Hadoop option.
Select the little arrow under "Connect a File or Database", then Hadoop.
2. Enter the necessary information to connect
Your HDFS/Hadoop Admin/IT contact will need to provide the hostname to use as well as the connection method (HTTPFS, WebHDFS, or Knox Gateway), and the username and password (if needed). Alteryx provides the default port. If your company uses a different port, you will need to edit this here:
Alteryx uses the host and port to generate the URL to be used for the connection. You also have the option to edit the URL should this be required.
Clicking "Test" will test the connection to make sure it works.
3. Click OK and select the file to be read in
Clicking "OK" will bring up a new window with all files available in HDFS (Note: Alteryx currently supports .csv and .avro files in HDFS). You can either enter the file name manually or use the browse functionality to browse to the file.
Once you have successfully established a connection to HDFS, you should save the connection for future use. The easiest way to do that is as follows:
1. Copy the connection string now visible in the Input Tool
2. Open the Data Connections Manager
Go to Options > Advanced Options > Manage Data Connections and select Add Connection > Other
3. Enter a connection name and connection string and hit save
4. The HDFS connection will now be available in both Input and Output Tools to use under Saved Data Connections
To connect to Hive/Impala/Spark an ODBC driver is required which can be downloaded from the Alteryx drivers download page.
Once the driver has been downloaded, a new DSN needs to be set up through the Windows ODBC Data Sources Administrator (Control Panel > System and Security > Administrative Tools > Data Sources (ODBC)). Your HDFS/Hadoop Admin/IT contact might need to provide necessary hostnames/user credentials/etc. to connect.
Once a DSN has been set up and tested through the Data Source Administrator, you are ready to connect through Alteryx.
1. Add a new Input Tool to the canvas and browse to the ODBC connections.
Select the little drop down as you did for the HDFS connection, then go to Other Databases > ODBC
2. Select the Data Source Name
In the ODBC connection window, select the DSN that was set up through the Windows ODBC Data Sources Administrator and enter username/password as needed
3. Enter/Build your query
Use the Tables, Visual Query Builder, or SQL Editor tabs to create the query reading in the data from Hadoop to Alteryx.
Note: As a new feature in Alteryx 11 the Visual Query Builder does not load automatically for all databases because it can be slow to load (e.g. large schemas in Hive). Click the "Load" button to load tables into the Visual Query Builder.
If the Visual Query Builder is blank after clicking the "Load" button, try checking "Get Tables with Query" in the Advanced options in your ODBC DSN. With the option checked, the driver uses the SHOW TABLES query to retrieve table names. Without it checked, the driver uses the GetTables Thrift API call to retrieve table names.
1. Open the Data Connections Manager
Go to Options > Advanced Options > Manage Data Connections and select Add Connection > Other
2. Enter a connection name and select ODBC as the connection type
3. Select the Data Source Name
In the ODBC connection window, select the DSN that was set up through the Windows ODBC Data Sources Administrator and enter username/password as needed
4. The connection will now be available in both Input and Output tools to use under Saved Data Connections
Thank you for this tutorial.
It works fine for me in the intranet but if I connect from AWS Cloud I get an error.
Is there a way to access the services via a proxy because I cant reach the intranet?
The internet I reached from the aws only about a proxy.
The error message.
Error: Output Data (7): Failed to retrieve upload redirect URL (myserver.de:50070
<div id="titles">
<h1>ERROR</h1>
<h2>The requested URL could not be retrieved</h2>
</div>
<hr>
<div id="content">
<p>The following error was encountered while trying to retrieve the URL: <a href="http://myserver.de:50070/webhdfs/v1/table?">http://myserver:50070/webhdfs/v1/table?</a></p>
<blockquote id="error">
<p><b>Unable to determine IP address from host name <q>myserver.de</q></b></p>
</blockquote>
<p>The DNS server returned:</p>
<blockquote id="data">
<pre>Name Error: The domain name does not exist.</pre>
</blockquote>
<p>This means that the cache was not able to resolve the hostname presented in the URL. Check if the address is correct.</p>
Hi,
I was able to connect Alteryx with my Hive tables. The HIVE tables are on Hortonworks 2.4 hadoop environment and Alteryx is in my local desktop system. I am using-
ODBC driver
Keytab file for authentication and
MIT kerberos
The problem is that the speed is very low. It is taking almost 1.5 hours to load data of around 2 GB in Alteryx from HIVE.
I also connected to corresponding HDFS file as detailed above by Henrietteh, but the speed issue is still there.
Can Henrietteh or someone help with this issue. Thank You.
We also have a similar problem as @abdj where we can connect to Hive but data loading takes a lot of time. Query builder is also not working.
We have "FastSQLPrepare" property set in our ODBC connection. Do you recommend any specific version of ODBC?
FYI: I am using Alteryx 11.0.5 and Hive 1.2 and ODBC driver 2.0.5.10005(64 bit)
Thanks in advance,
AB
Nope @eliasg, It works but it is very very slow. I have no idea. It seems a problem in ODBC driver since it works fine on sql server.
Hi
@aser , Did you fix it?