Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Knowledge Base

Definitive answers from Designer Desktop experts.

How to: Write to Hive Faster

HenrietteH
Alteryx
Alteryx
Created

How To: Write to Hive Faster

Hive ODBC can be slow when loading data to tables. If you are looking for a faster option to write to Hive and want to create a new table or overwrite an existing table, use the In-DB tools to output your data.

Prerequisites

  • Alteryx Designer
  • Windows Operating System
  • A working ODBC DSN for Hive
  • Write access to HDFS directly

Procedure

An In-DB connection is needed to be able to utilize this option. To create the connection:

1. Open the Manage In-DB Connections window either by going to Options > Advanced Options > Manage In-DB Connections or by dragging a Connect In-DB tool onto the canvas and selecting Manage Connections in the drop down.

2. Select Hive as the Data Source.

3. Click "New" to create a new connection or select an existing connection to edit

4. For a new connection, enter a Connection Name

5. On the Read tab, select the ODBC DSN to be used.

6. On the Write tab, select HDFS(Avro) as the Driver.

7. Click on the drop down and selectNew HDFS connection... to create the connection string or select an existing HDFS connections if any are displayed.

- It is helpful to first test the connection in a regular input tool to make sure it works

- The information to enter into the window can be obtained from the DB Admin/IT team

- The Temp Directory needs to be filled in for Alteryx to be able to write out a temporary avro file. By default the value is /temp. It can be changed to any directory the user has access to.

2019-08-12_18-41-08.png

8. To write a new table in Alteryx, add a Data Stream In tool to the canvas.

9. Alteryx will perform three steps when writing using this option:

1. A table will be created using a CREATE TABLE statement and the information provided in the ODBC DSN. Data types and column headers are based on the data going into the tool.

2. An AVRO file will be dropped into the temporary location specified in the HDFS connection set up on the Write tab of the In-DB Connection.

3. A LOAD DATA INPATH statement will be run to move the data from the temporary location to the table's location.

NOTE: This option does not work for updates, only when creating new tables or overwriting existing ones.

The same process can be applied for writing to Impala.

Additional Resources

Comments
OJ_TU
5 - Atom
Henriette, When using Alteryx In database to write to Hive, it seems not to use a comma or tab delimited file format as standard. It is using some kind of return unspecified format that is creating down-stream issues for me. Can I: 1) specify the table format or underlying file type when creating table or 2) alter the file format after creating it. Thanks.
HenrietteH
Alteryx
Alteryx

Hi @OJ_TU

We support csv and avro files to write via HDFS. Selecting csv will write out a comma delimited file. 

PabloQ
5 - Atom

Hi @HenrietteH

 

In our case, we have an Excel file with +500k registers and we need to append into  Impala Parquet table.

 

In our flow we use Data stream In Tool (to move info from Excel to "In-DB" tools) and we use and In-DB Output tool with ODBC. We are messing up the cluster with the tiny files created to ODBC.

 

I have a couple of questions about your approach:

 

  1. Can we use this AVRO format with Impala tables or just Hive ones?
  2. If we use Hive tables with this Avro format, we will still need to use the  Data Stream In Tool (because Excel file is not in In-DB Tools). In this step we can spend from 1 to 3 hours. Do you have anny suggestion to improve that step?

Thanks in advance!

 

mgrajkumar
7 - Meteor

There are some serious performance issues while writing to HIVE in ORC format. Any solutions available?. 

Hi,

 

Can you help me understand where can i find the host name and URL for HDFS Avro Connection

HenrietteH
Alteryx
Alteryx

Hi @Deepakchandrashekar 

The URL is generated automatically based on the host name (although you can edit it if you need to). If you use a Web UI to connect to HDFS to manage files, the host name should be the same but your Hadoop admin or your IT department should be able to confirm or tell you what to use instead.