This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Welcome back to another exciting edition of "Will it Alteryx?" In this installment I'll be looking at Parquet, a columnar storage format primarily used within the Hadoop ecosystem. While Parquet is growing in popularity and being used outside of Hadoop, it is most commonly used to provide column-oriented data storage of files within HDFS and sometimes as a storage format for Hive tables.
Interest in Parquet has rapidly surpassed both ORC and Avro formats.
A column-oriented data storage format organizes tables by column rather than row. This can provide for much more efficient querying by applications which are looking for specific values rather than entire records. It can also provide other benefits such as encoding and compressing files. As an example, I took a 2 MB CSV file and converted it to a parquet file which was almost 40% smaller in file size.
Parquet storage format typically provides significant savings in file sizes.
As more and more organizations are moving to the cloud, reducing file sizes can provide an immediate benefit in savings on storage costs. But, I know you are wondering how can we leverage parquet files in Hadoop with Alteryx when the HDFS Input only supports CSV or Avro file types?
The HDFS File Selection tool only allows Avro or CSV file types.
There are a couple of ways to process parquet data with Alteryx. This is not meant to be an exhaustive list but to mention some of the methods.
df = sqlContext.read.parquet("/hdfs_path/file.parquet")
Package.installPackages(['pyarrow'])
import pyarrow as pa
pa.hdfs.connect(host, port, username)
from ayx import Package
from ayx import Alteryx
Package.installPackages(['wget'])
import pandas as pd
import wget
host = 'hdfs.namenode.com'
port = '9870'
file = '/mydir/myfile.parquet' # the HDFS file path
url = 'http://' +host+ ':' +port+ '/webhdfs/v1' +file+ '?op=OPEN'
file = wget.download(url)
df = pd.read_parquet(file)
Alteryx.write(df, 1)
Hopefully this gives you some ideas on how you can use Alteryx to process parquet data in your organization. If you have other ideas on how this can be accomplished please add them to the comments below.
If you have any technologies you would like to see explored in future installments of the "Will it Alteryx" series, please leave a comment below!
David works as a Solutions Architect helping customers understand the Alteryx platform, how it integrates with their existing IT infrastructure, and how Alteryx can provide high performance and advanced analytics. He's passionate about learning new technologies and recognizing how they can be leveraged to solve organizations' business problems.
David works as a Solutions Architect helping customers understand the Alteryx platform, how it integrates with their existing IT infrastructure, and how Alteryx can provide high performance and advanced analytics. He's passionate about learning new technologies and recognizing how they can be leveraged to solve organizations' business problems.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.