I know from other peoples posts on here that Alteryx doesn't natively support outputting columnar data such as parquet files. I was wondering if anybody has created a workaround for this for uploading to hdfs. We're using Alteryx v11.7, so ideally a solution which uses the capabilities of this version but I'm also interested in the best method going forward as we update Alteryx. Are there any clever shenanigans which can be done using the run command tool?
Thanks for your help!
Solved! Go to Solution.
Hi @JonnyR
Have you voted on the idea here: https://community.alteryx.com/t5/Alteryx-Designer-Ideas/Add-Parquet-data-format-as-input-amp-output/...
And have you seen Durga's workaround here: https://community.alteryx.com/t5/Alteryx-Knowledge-Base/Access-Data-in-Parquet-Format/ta-p/80941
Thanks,
Joe
I think Durga's article has been reattributed here.
For some reason, I am getting an "Access Denied" message when I try to access that link. Do you know why that might be?
Thanks in advance.
@Mogul1 Yes, that article was archived because it was found to be inaccurate. I'm sorry to be the bearer of bad news. I had tried to do something similar and mostly succeeded. Here is a part of what worked for me. It is not officially supported at the moment.
To create the table:
Text in the Comment tool:
This is a test workflow to create to a parquet table
To use it, follow these steps:
Make sure that you have an ODBC connection to Hive with the following settings in Server Side Properties:
hive.default.fileformat=parquet
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
In the Input Data tool replace the connection that's there with your own.
Replace the Post-SQL statement with one appropriate to your use case:
Set the tablename you want to create and set the partition columns separately from the rest of the data in the PARTITIONED BY argument. If you have more than one column that the table is partitioned by, then list them in order in the field separated by commas.
Set the table to open as desired. You will not use the data here. Alteryx fails to execute Post-SQL statement if no data at all are returned, so you need something, but anything at all will do.
Text version of the Post SQL statement I used:
CREATE TABLE parquet_test (storenum int, address string, city string, state string, zip string, lat double, lon double, county string, region string) PARTITIONED BY (type string) STORED AS PARQUET;
To append:
Text in Comment tool:
This is a test workflow to append to a parquet table
To use it, follow these steps:
Make sure that you have an ODBC connection to Hive with the following settings in Server Side Properties:
hive.default.fileformat=parquet
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
Open up the macro and replace the connection that's there with your own. If you change the name of the temporary table from lltest to something else, you will also have to change the Post-SQL statement correspondingly.
Save the macro with the new data connection.
The macro should then update in the workflow.
Set the tablename you wish to append in the Text Input tool, and set the partition columns there as well. If you have more than one column that the table is partitioned by, then list them in order in the field separated by commas.
The partition columns must be the last fields coming into the macro, so if they were not already, you'll want to use a Select tool before the macro.
For this test, you can just select sample data you wish to append to the target table. Replace the Input Data tool with whatever you need to supply that data.
text of the Post-Create SQL Statement I used: INSERT INTO parquet_test PARTITION (type) Select * FROM lltest;
DROP TABLE lltest;
I wish you the best of luck.