06-22-2017 04:26 PM - edited 07-21-2021 01:50 PM
How To: Write to Hive Faster
Hive ODBC can be slow when loading data to tables. If you are looking for a faster option to write to Hive and want to create a new table or overwrite an existing table, use the In-DB tools to output your data.
Prerequisites
Procedure
An In-DB connection is needed to be able to utilize this option. To create the connection:
1. Open the Manage In-DB Connections window either by going to Options > Advanced Options > Manage In-DB Connections or by dragging a Connect In-DB tool onto the canvas and selecting Manage Connections in the drop down.
2. Select Hive as the Data Source.
3. Click "New" to create a new connection or select an existing connection to edit
4. For a new connection, enter a Connection Name
5. On the Read tab, select the ODBC DSN to be used.
6. On the Write tab, select HDFS(Avro) as the Driver.
7. Click on the drop down and selectNew HDFS connection... to create the connection string or select an existing HDFS connections if any are displayed.
- It is helpful to first test the connection in a regular input tool to make sure it works
- The information to enter into the window can be obtained from the DB Admin/IT team
- The Temp Directory needs to be filled in for Alteryx to be able to write out a temporary avro file. By default the value is /temp. It can be changed to any directory the user has access to.
8. To write a new table in Alteryx, add a Data Stream In tool to the canvas.
9. Alteryx will perform three steps when writing using this option:
1. A table will be created using a CREATE TABLE statement and the information provided in the ODBC DSN. Data types and column headers are based on the data going into the tool.
2. An AVRO file will be dropped into the temporary location specified in the HDFS connection set up on the Write tab of the In-DB Connection.
3. A LOAD DATA INPATH statement will be run to move the data from the temporary location to the table's location.
NOTE: This option does not work for updates, only when creating new tables or overwriting existing ones.
The same process can be applied for writing to Impala.
Additional Resources
Hi @OJ_TU
We support csv and avro files to write via HDFS. Selecting csv will write out a comma delimited file.
Hi @HenrietteH
In our case, we have an Excel file with +500k registers and we need to append into Impala Parquet table.
In our flow we use Data stream In Tool (to move info from Excel to "In-DB" tools) and we use and In-DB Output tool with ODBC. We are messing up the cluster with the tiny files created to ODBC.
I have a couple of questions about your approach:
Thanks in advance!
There are some serious performance issues while writing to HIVE in ORC format. Any solutions available?.
Hi,
Can you help me understand where can i find the host name and URL for HDFS Avro Connection
The URL is generated automatically based on the host name (although you can edit it if you need to). If you use a Web UI to connect to HDFS to manage files, the host name should be the same but your Hadoop admin or your IT department should be able to confirm or tell you what to use instead.