Hi all experts,
Current i face problem when i try to use output file or IN-DB, it is very slow. 30MB size csv import to hive database it tooks 2 and half hours to finish. Why is it happen? what factor decide the speed of output data from alteryx to hive database? HDFS? Internal network issues? ODBC config? More details: Im using lastest Simba ODBC to connect and get the data from UAT mysql database server, input data by input tools from mysql and local to browser data, it is fast and takes no more than 10 secound. In-db tools i tried also, read is fast and wirte is super slow. (no different with output tools). With this super slow speed of write data,totally unacceptable speed on real life. At the beginning i think is hdfs or hive issues so I tried to move file to HDFS and copyFromLocal write to hive db used only around 5 secound to finish mapreduce.
Please help and i can provide more information if you need, thank you!
More inforamtion to let you guys:
FYI,HDFS cluter and hive is using Apache Ambari open source management platform
This picture is how Alteryx get data on testing environment.
Alteryx Output data setting:
Window Server ODBC config: |
There are a whole slew of reasons why this could be writing slow (including internal one size fits all sql generation from Alteryx)... two things I'd try - 1) add a sample tool prior to writing with a limit of 10X the files in your .csv - this tells Alteryx and HIVE to expect a certain number of records instead of allowing an infinite write statement. 2) Try Datastream In as your write tool. I only use HIVE in conjunction with Spark/Databricks - so the writing there is through the Spark driver but I'm not seeing these issues.
Hi apathetichell,
Thank you for your reply,
I made some changing after read alteryx ODBC document
Simba ODBC -> selected
Manage in-DB connections
write section
After that 150 MB data by using datastream and changed config. it taks 32 mins to finish. I see that's big improvement, but still want to know how to insert into hive more faster. The speed still not good enough, cant even finish the test case in UAT environment. And i want to know is it becaues the HDFS make hive insert from alteryx so slow? or config not setting well?
i tried read data from hive, 30 second read 150MB data from hive. That's so wired.
Looking forward for your reply and hope other guys with same experience before can give the suggestion too thanks.
