Dynamically update Workflow based on Input Schema

Question

Hi everyone!

We're currently looking at hundreds of files and trying to "figure out" what column is most probably the Primary Key.

I have created a simple workflow which will read/write .CSV files from/to HDFS. Step by step, I am:

* Summarizing all the columns from the input to count unique values ("Count distinct")
* Transposing to obtain 2 columns (column_name + count_Distinct)
* Appending the total number of rows for each and calculating the % of unique values
* Sorting by % descending, selecting relevant columns and outputting the result.

This works well for a single file, but it would be amazing if we could automate the process for our +- 300 files tables (with different schema/size):

* For the input, I'm thinking of a batch macro for all files and Auto configure by name to adapt for the schemas
* For the output, I'm thinking about taking the name of the .CSV from a field (1 output per table name)
* My issue is dynamically updating the "summarize" tool, which is crucial for the workflow. The tool would need to "Select All columns" and "Count Discount" for all, no matter how many columns are within the table. I thought about modifying the inner XML with an action tool, but I'm unsure if specifying "select all + count discount" is do-able.

Thanks in advance!

Find_Primary_Key.png

LukeM · Accepted Answer

Hi @YULteryx ,

Instead of using the Summarize tool to Count Distinct, could you:

* Transpose all your data to just Name and Value fields
* Use a Unique tool to get just distinct Name and Value pairs
* Add a new field which is just a 1 for every row
* Use the Summarize tool to 'Group By' Name and 'Sum' the new 1 field, creating a count.

This should replicate the Count Distinct but in a more dynamic way for your Batch Macro.

Hope this helps.

Luke

Claje · Accepted Answer

Hi,

There may be a performance reason not to do this, but have you tried transposing the data first?

Then you can Group By the Name field, and take a Count of Value and a Count Distinct of value at the same time, which will let you perform your calculations from there.

You might need to filter out NULL values as well.