Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Knowledge Base

Definitive answers from Designer Desktop experts.

Guide to Cross-Validation

DiganP
Alteryx Alumni (Retired)
Created

In Alteryx, there are 5 customizable options within the Cross-validation screen:

 

  1. Number of folds
  2. Number of trials
  3. Enter positive class for target variable
  4. Use stratified cross-validation
  5. Set seed

Validation Pic 1.png

  1. Number of folds

This option will randomly split your data into equal-sized samples (5 equal-sized samples would be generated in the example below). Using the 80/20 approach, 4 of these samples is used as validation data and the remaining samples are used as training data (4 validation sample and 1 training samples in example below). This process is repeated, where each sample is used as validation data 1 time (this process would be repeated 5 times in the example below). For example, let’s label each sample segment as A, B, C, D, E. In the first iteration sample segment A, B, C, D is used as validation and E is used as training data. In the second iteration, B, C, D, and E are used as validation data and A is used as training data and so on. A higher number of folds will result in more robust estimates of model quality but will take a longer time to run.Validation Pic 2.png

  1. Number of trials

This option allows the user to choose the number of times the cross-validation procedure should be repeated in case the first random split of data is skewed in the folds. The folds are selected differently in each trial and the overall results are averaged across all the trials. For example, in the screen shown below, the cross-validation procedure would be repeated 3 times.Validation Pic 3.png

 

  1. Enter positive class for target variable

    Some of the measures reported, such as the F1 score, require a distinction between a positive class (such as “Yes” or 1) and a negative class (such as “No” or 0). This configuration option is not required, and the tool will choose which of the classes is positive if left blank.Validation Pic 4.png

 

 

  1. Use stratified cross-validation

Stratification is the process of rearranging the data to ensure each fold is a good representative. This is usually recommended when the target variable is imbalanced. For example, the table below:

  

Yes

No

60

40

 

Say we use 5 folds (as in the example above), we would want each fold to represent the same overall proportions of Yes (60%) and No (40%). This would mean that each fold contains 12 “Yes” values (60/100*20 = 12) and 8 “No” values (40/100*20 = 8). This makes sure you have equal representation from the sample data.  

Validation Pic 5.png 

  1. Set seed

This is used when you want to estimate the accuracy of a model based on a random percentage. It allows you to duplicate the same results in another workflow. Changing the seed will change the fold’s composition and if this option is not selected, a different sample will be generated each time workflow is executed.

Validation Pic 6.png

Comments
data2
6 - Meteoroid

Hi thanks for your post . I'm confused with what the number of folds option exactly means. If we split the data into 5 folds, shouldn't the first iteration A be the validation set and B, C, D, E are grouped together as the training set? Then 2nd iteration: B is the validation set and A, C, D, E the training set etc. (k-fold cross validation) Seems like my understanding of this option is K-fold cross validation but your explanation suggests otherwise. Can you help me understand what it really means?

DawnDuong
13 - Pulsar
13 - Pulsar

Thank you, detailed and helpful post!