Alteryx Designer Knowledge Base

Definitive answers from Designer experts.
Andy Uttley, Alteryx ACE, makes music with Alteryx | Math + Music

How To: Access Data in Parquet Format

Alteryx
Alteryx
Created

How To: Access Data in Parquet Format

 

Parquet is a columnar storage format used primarily in the Hadoop ecosystem. This differs from the traditional row oriented approach. The usage of a columnar storage format makes the data more homogeneous and thus allows for better compression. 

Data in Parquet format can be accessed via the Hive ODBC driver.

 

Prerequisites

 

  • Product - Alteryx Designer
  • 3rd party Product - Access to Hive database (version 0.13 or higher)
  • Hive ODBC driver

 

Saving data in Parquet format in Hive

  1. Create a table in your Hive database with "STORED AS PARQUET" for hive 0.13 and later. The create table syntax is explained in this article. See also below for it.
  2. CREATE TABLE parquet_test (
     id int,
     str string,
     mp MAP<STRING,STRING>,
     lst ARRAY<STRING>,
     strct STRUCT<A:STRING,B:STRING>) 
    PARTITIONED BY (part string)
    STORED AS PARQUET;

 

Please note that for this to work the correct permissions to create a table must have be set up.

 

Alternatively, find attached a solution example as Analytic App that uses the Post Create SQL Statement to alter the table to Parquet format:

ALTER TABLE table_name SET FILEFORMAT PARQUET;

 

Please note that if you want to use the attached example you will have to adjust the Connection String and input file name to your use case.

 

Access Data already stored in Parquet format

For data already stored in the Parquet format in HDFS, use the LOAD DATA statement to load the data in the HDFS file to a table in Hive.

Designer can read and write data from these tables via the Hive ODBC driver.

 

Writing data from Designer to new Hive table in Parquet

To write results of an Alteryx workflow back to a new Hive table in the Parquet format, use hive.default.fileformat=PARQUET in the Server Side Properties ODBC driver configuration under "Advanced Options". Please note that for this the database server needs to be accessed to make the change. Typically, DBA involvement is required. 

 

Additional Resources

Attachments