This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
on 01-29-2018 08:51 AM - edited on 05-21-2019 02:49 PM by SydneyF
In Alteryx, there are 5 customizable options within the Cross-validation screen:
This option will randomly split your data into equal-sized samples (5 equal-sized samples would be generated in the example below). Using the 80/20 approach, 4 of these samples is used as validation data and the remaining samples are used as training data (4 validation sample and 1 training samples in example below). This process is repeated, where each sample is used as validation data 1 time (this process would be repeated 5 times in the example below). For example, let’s label each sample segment as A, B, C, D, E. In the first iteration sample segment A, B, C, D is used as validation and E is used as training data. In the second iteration, B, C, D, and E are used as validation data and A is used as training data and so on. A higher number of folds will result in more robust estimates of model quality but will take a longer time to run.
This option allows the user to choose the number of times the cross-validation procedure should be repeated in case the first random split of data is skewed in the folds. The folds are selected differently in each trial and the overall results are averaged across all the trials. For example, in the screen shown below, the cross-validation procedure would be repeated 3 times.
Some of the measures reported, such as the F1 score, require a distinction between a positive class (such as “Yes” or 1) and a negative class (such as “No” or 0). This configuration option is not required, and the tool will choose which of the classes is positive if left blank.
Stratification is the process of rearranging the data to ensure each fold is a good representative. This is usually recommended when the target variable is imbalanced. For example, the table below:
Yes |
No |
60 |
40 |
Say we use 5 folds (as in the example above), we would want each fold to represent the same overall proportions of Yes (60%) and No (40%). This would mean that each fold contains 12 “Yes” values (60/100*20 = 12) and 8 “No” values (40/100*20 = 8). This makes sure you have equal representation from the sample data.
This is used when you want to estimate the accuracy of a model based on a random percentage. It allows you to duplicate the same results in another workflow. Changing the seed will change the fold’s composition and if this option is not selected, a different sample will be generated each time workflow is executed.
Hi thanks for your post . I'm confused with what the number of folds option exactly means. If we split the data into 5 folds, shouldn't the first iteration A be the validation set and B, C, D, E are grouped together as the training set? Then 2nd iteration: B is the validation set and A, C, D, E the training set etc. (k-fold cross validation) Seems like my understanding of this option is K-fold cross validation but your explanation suggests otherwise. Can you help me understand what it really means?
Thank you, detailed and helpful post!