Alteryx Designer Desktop Discussions

LinoLam99 · ‎08-13-2021

I am trying to prep my data to split it into my test and training data (Split 80/20). I have selected the First N% of rows function, and input a value of N=80, however on my workflow diagram it is displaying an "undefined 80%" label. I have been trying to find out what the issue is, but have not been successful. Please help.

phottovy · ‎08-13-2021

Hi @LinoLam99 ,

You are getting this error because you have every column selected under the group by section. The tool will see each row in your dataset as a "group" and doesn't know how to split it into 80% of each row. You should only use the group by if you want a certain portion of certain columns. In your example, say you wanted to keep 80% of each "Customer Segment". Then you would want to check this column to group by. If you only want 80% of your total data, unselecting all of the columns will give you what you are looking for.

You might also want to look at the "Create Samples" tool. With the "Sample" tool, the other 20% gets dropped while the "Create Samples" tool has outputs for both the train and test sets.

apathetichell · ‎08-13-2021

also - use the create samples tool... it will split your test/train into streams automatically.

LinoLam99 · ‎08-13-2021

Hi @phottovy,

Thank you for such a quick reply. I have used the create sample tools instead and it seems to have done the job. I can see that the create samples tool has three outputs: being estimation (train), validation (test), and holdout. I was wondering what the holdout output is for?

phottovy · ‎08-13-2021

@apathetichell I edited my post to recommend the "Create Samples" tool a couple minutes after posting my initial response but good catch from my explanation.

apathetichell · ‎08-13-2021

Records that are neither in the testing nor training data. Used on larger datasets. Worth mentioning that your original strategy would not have worked. If you select 80% and then 20% for training and testing respectively they need to come from the same seed - or you your second selection should (statistically speaking) be 80% (16/20) repeated with your earlier data. That's not what you want. sample tool that you used doesn't allow for seed setting so that's not going to give you what you want. Random % Sample COULD with a join be used to divided up a set but I'd recommend just using create samples - or as I do it hard code it in R.

danilang · ‎08-14-2021

Hi @LinoLam99

In any data analysis work, you have 2 phases, 1) developing the model. 2) Validating the model. When developing the model you need 2 distinct datasets, one to train the model and the other to test it. You iterate developing and refining the model using the same training and testing datasets in each iteration. Once your model is a refined as required, you then validate it using the holdout dataset to validate it. Using the holdout dataset as a final validation is a check against biases introduced by the process that created the three datasets. There is always a small chance that the random process of partitioning your data into the three groups has not sampled from all the independent variables evenly.

In order to ensure that the training, testing and holdout datasets don't change throughout your development stage, you should have a separate workflow that partitions the data into three identified tables. From then on use these three tables as the input to your development and validation workflows.

Dan

Alteryx Designer Desktop Discussions

Alteryx First N% Rows (e.g. 80%) coming up as "undefined 80%"

Re: Macro not Looping thru Files in Folder

Re: Is there any way the computer vision tools can...

Re: Batch Macro

Re: How to get cell reference address from excel

Re: Replacing Forecast columns with Actual Data

Alteryx Designer Desktop Discussions

Alteryx First N% Rows (e.g. 80%) coming up as &quot;undefined 80%&quot;

Re: Macro not Looping thru Files in Folder

Re: Is there any way the computer vision tools can...

Re: Batch Macro

Re: How to get cell reference address from excel

Re: Replacing Forecast columns with Actual Data

Alteryx First N% Rows (e.g. 80%) coming up as "undefined 80%"