Bring your best ideas to the AI Use Case Contest! Enter to win 40 hours of expert engineering support and bring your vision to life using the powerful combination of Alteryx + AI. Learn more now, or go straight to the submission form.
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Alteryx First N% Rows (e.g. 80%) coming up as "undefined 80%"

LinoLam99
5 - Atom

I am trying to prep my data to split it into my test and training data (Split 80/20). I have selected the First N% of rows function, and input a value of N=80, however on my workflow diagram it is displaying an "undefined 80%" label. I have been trying to find out what the issue is, but have not been successful. Please help. 

6 REPLIES 6
phottovy
13 - Pulsar
13 - Pulsar

Hi @LinoLam99 ,

 

You are getting this error because you have every column selected under the group by section. The tool will see each row in your dataset as a "group" and doesn't know how to split it into 80% of each row. You should only use the group by if you want a certain portion of certain columns. In your example, say you wanted to keep 80% of each "Customer Segment". Then you would want to check this column to group by. If you only want 80% of your total data, unselecting all of the columns will give you what you are looking for.

 

You might also want to look at the "Create Samples" tool. With the "Sample" tool, the other 20% gets dropped while the "Create Samples" tool has outputs for both the train and test sets.

apathetichell
20 - Arcturus

also - use the create samples tool... it will split your test/train into streams automatically.

LinoLam99
5 - Atom

Hi @phottovy,

 

Thank you for such a quick reply. I have used the create sample tools instead and it seems to have done the job. I can see that the create samples tool has three outputs: being estimation (train), validation (test), and holdout. I was wondering what the holdout output is for?

phottovy
13 - Pulsar
13 - Pulsar

@apathetichell  I edited my post to recommend the "Create Samples" tool a couple minutes after posting my initial response but good catch from my explanation.

apathetichell
20 - Arcturus

Records that are neither in the testing nor training data. Used on larger datasets. Worth mentioning that your original strategy would not have worked. If you select 80% and then 20% for training and testing respectively they need to come from the same seed - or you your second selection should (statistically speaking) be 80% (16/20)  repeated with your earlier data. That's not what you want. sample tool that you used doesn't allow for seed setting so that's not going to give you what you want. Random % Sample COULD with a join be used to divided up a set but I'd recommend just using create samples - or as I do it hard code it in R.

danilang
19 - Altair
19 - Altair

Hi @LinoLam99 

 

In any data analysis work, you have 2 phases, 1) developing the model.  2) Validating the model.  When developing the model you need 2 distinct datasets, one to train the model and the other to test it.  You iterate developing and refining the model using the same training and testing datasets in each iteration.  Once your model is a refined as required, you then validate it using the holdout dataset to validate it.  Using the holdout dataset as a final validation is a check against biases introduced by the process that created the  three datasets.  There is always a small chance that the random process of partitioning your data into the three groups has not sampled from all the independent variables evenly.  

 

In order to ensure that the training, testing and holdout datasets don't change throughout your development stage, you should have a separate workflow that partitions the data into three identified tables.  From then on use these three tables as the input to your development and validation workflows.

 

Dan  

 

Dan

Labels
Top Solution Authors