Hi all,
my dataset is a Hive file with approx 2.000 columns and several millions of lines. I need trifacta to keep only some columns (about 500) and approx. 10% of the lines based on a filter on the values of one column.
What is more efficient, removing useless columns first or filtering rows first?
(the column used for filtering is kept in the final output).
Thanks for your advice.
MM
Solved! Go to Solution.
Hi, Michael--
That's a wide dataset. I would drop columns first.
Keep in mind that what you see on screen in the Transformer page is a sample. In your sample, all columns in the dataset are represented in some form, so in your case, the initial sample will have a smaller number of rows because of the large number of columns. After you drop your columns, you should take another sample, which will bring back a larger number of rows.
Here's a good topic on removing data from your dataset.
https://docs.trifacta.com/display/SS/Remove+Data
You can remove a range of columns in a single step. The operative character is the tilde, which is how you specify a range of columns. See the second example here.
https://docs.trifacta.com/display/SS/Remove+Data#RemoveData-Dropcolumns
After you have made your dataset more narrow, you can work to filter out rows.
Hope that helps.
Cheers,
-SteveO
Hello Steve,
thanks for this exhaustive answer. This is very usefull.
Have a nice day.