Engine Works

DavidHa · ‎08-06-2019

Will It Alteryx Blog - Banner - Parquet.png

Welcome back to another exciting edition of "Will it Alteryx?" In this installment I'll be looking at Parquet, a columnar storage format primarily used within the Hadoop ecosystem. While Parquet is growing in popularity and being used outside of Hadoop, it is most commonly used to provide column-oriented data storage of files within HDFS and sometimes as a storage format for Hive tables.

Interest in Parquet has rapidly surpassed both ORC and Avro formats.

A column-oriented data storage format organizes tables by column rather than row. This can provide for much more efficient querying by applications which are looking for specific values rather than entire records. It can also provide other benefits such as encoding and compressing files. As an example, I took a 2 MB CSV file and converted it to a parquet file which was almost 40% smaller in file size.

Parquet storage format typically provides significant savings in file sizes.

As more and more organizations are moving to the cloud, reducing file sizes can provide an immediate benefit in savings on storage costs. But, I know you are wondering how can we leverage parquet files in Hadoop with Alteryx when the HDFS Input only supports CSV or Avro file types?

The HDFS File Selection tool only allows Avro or CSV file types.

Will it Alteryx?

There are a couple of ways to process parquet data with Alteryx. This is not meant to be an exhaustive list but to mention some of the methods.

For Hive tables stored in parquet format, a few options exist which are covered in this Knowledge-Base article.
If you are running on a Hadoop client machine (like an edge node), you can use Spark Code or Python Code to read the data into a DataFrame and then pass that to the Apache Spark Code tool or the Python tool in Designer.
1. Example Spark code:
```
df = sqlContext.read.parquet("/hdfs_path/file.parquet")
```
2. Example Python code using the PyArrow package:
```
Package.installPackages(['pyarrow'])
import pyarrow as pa
pa.hdfs.connect(host, port, username)
```
However, most of us aren't running on a Hadoop client machine, so the following solution allows you to read parquet data from HDFS directly into Designer. This works via the WebHDFS or HttpFS restful interfaces to HDFS. Some Hadoop administrators might block this feature, or have it only accessible through a Knox Gateway. Please work with your Hadoop Administrator for details.
1. In Designer, pull down the Python tool from the Developer category.
2. In the Python tool code editor, insert the following code making modifications to match your environment. There are additional options for specifying credentials or a Kerberos token.
```
from ayx import Package
from ayx import Alteryx
Package.installPackages(['wget'])     
import pandas as pd
import wget

host = 'hdfs.namenode.com'
port = '9870'   
file = '/mydir/myfile.parquet'        # the HDFS file path
url = 'http://' +host+ ':' +port+ '/webhdfs/v1' +file+ '?op=OPEN'
file = wget.download(url)
df = pd.read_parquet(file)
Alteryx.write(df, 1)
```
3. The data set will be available at the "1" output anchor where additional tools can be added to build a workflow.
  
  Reading parquet data from HDFS through the Python tool
4. Note, the "Package.installPackages" line requires Designer to be "Run as Administrator" and only needs to be executed one time.

Final Thoughts

Hopefully this gives you some ideas on how you can use Alteryx to process parquet data in your organization. If you have other ideas on how this can be accomplished please add them to the comments below.

If you have any technologies you would like to see explored in future installments of the "Will it Alteryx" series, please leave a comment below!

References

rag-ryx · ‎08-19-2019

@DavidHa - thank you so much for sharing this article. Many of us have been trying to solve this for quite sometime now.

In my workplace we are using Apache Spark on Databricks. Can you please let me know how can I access the Parquet files in this case?

Look forward to your response. Thank you!

--

Best,

Raghav

shaynie · ‎11-22-2019

Thanks so much! I needed to read in single Parquet files stored locally (not loaded into HDFS). Though that task is much simpler than the ones you show above, I didn't have a clue! Using the python tool and this command it worked!

df = PD.read_parquet('filename',engine='pyarrow')

Alteryx.write(df, 1)

Yay!

nimitmalhotra · ‎11-06-2020

Hi @DavidHa

This might sound to be simple question, but how do I identify "Host" and "Port" of the Parquet file which I am trying to read on Alteryx? My Parquet file is on MS Azure environment (within Prepared layer). I don't use Hive or Databricks yet.

Looking forward to your support!

Regards

TimN · ‎04-28-2021

Hi David,

Will these methods work for Workflows posted to the gallery?

Thanks.

lisasivart · ‎06-18-2021

Has anyone had any luck using Alteryx to write to a Parquet file in Azure?

Scott_Kane · ‎02-04-2022

Parquet is often used as the storage method for a table on the back end of a system like Cloudera Impala. If you have a table on top of your Parquet file, or much more likely files, you can just work with the table as any other table via ODBC.

It would be enormously helpful if you could use Alteryx to write parquet in our installation. Avro is the only realistic file format we have to work with hadoop other than resorting to CSV files.

Engine Works

Parquet, will it Alteryx?

Will it Alteryx?

Final Thoughts

References