Hi everyone!
We're currently looking at hundreds of files and trying to "figure out" what column is most probably the Primary Key.
I have created a simple workflow which will read/write .CSV files from/to HDFS. Step by step, I am:
This works well for a single file, but it would be amazing if we could automate the process for our +- 300 files tables (with different schema/size):
Thanks in advance!
Solved! Go to Solution.
Hi @YULteryx ,
Instead of using the Summarize tool to Count Distinct, could you:
This should replicate the Count Distinct but in a more dynamic way for your Batch Macro.
Hope this helps.
Luke
Hi,
There may be a performance reason not to do this, but have you tried transposing the data first?
Then you can Group By the Name field, and take a Count of Value and a Count Distinct of value at the same time, which will let you perform your calculations from there.
You might need to filter out NULL values as well.
Love the finding the Primary Key solution by the way!
Thank you both for your answers! It clearly shows how Alteryx offers several ways to obtain the same output.
I will try to implement your approach and let you know how it goes.
Cheers.
[EDIT] - it works perfectly. Appreciate your support!