Hi everyone!
We're currently looking at hundreds of files and trying to "figure out" what column is most probably the Primary Key.
I have created a simple workflow which will read/write .CSV files from/to HDFS. Step by step, I am:
- Summarizing all the columns from the input to count unique values ("Count distinct")
- Transposing to obtain 2 columns (column_name + count_Distinct)
- Appending the total number of rows for each and calculating the % of unique values
- Sorting by % descending, selecting relevant columns and outputting the result.
This works well for a single file, but it would be amazing if we could automate the process for our +- 300 files tables (with different schema/size):
- For the input, I'm thinking of a batch macro for all files and Auto configure by name to adapt for the schemas
- For the output, I'm thinking about taking the name of the .CSV from a field (1 output per table name)
- My issue is dynamically updating the "summarize" tool, which is crucial for the workflow. The tool would need to "Select All columns" and "Count Discount" for all, no matter how many columns are within the table. I thought about modifying the inner XML with an action tool, but I'm unsure if specifying "select all + count discount" is do-able.
Thanks in advance!