Alteryx Designer Desktop Knowledge Base

SydneyF · ‎11-27-2018

You may have noticed a setting in a few of the Alteryx tools that refer to Setting a “Seed” or “Random Seed”.

This option shows up in tools that include a stochastic (i.e., randomized)process. These tools include the Boosted Model, tools that include a Cross-Validation component (Decision Tree, Linear Regression, and Logistic Regression), the Simulation Sampling and Simulation Scoring tools, and the Create Samples and Random % Sample tools. Stochastic processes are also referred to as random functions because they can be interpreted as a randomized element in the overall mathematical function. The sampling tools (both standard and simulation) randomly create sub-samples of data, the cross-validation method incorporated in the predictive tools includes a randomized sub-sampling routine, and stochastic gradient boosting features random sub-samples of data used to construct the model.

The randomized component of these tools is not truly random. The “random” starting point or numbers are produced with a pseudorandom number generator (also known as a deterministic random bit generator). A pseudorandom number generator is a deterministic algorithm that selects numbers that approximate the properties of sequences of random numbers. That means that although the numbers seem random to us, they are actually generated by a deterministic algorithm that creates number sequences that (only) look random. Truly random numbers can be generated using a hardware random number generator, however, pseudorandom number generators are important because they are able to quickly generate random numbers, and the "random" numbers are reproducible. Using a pseudorandom number generator ensures you are able to replicate your results and subsample groups.

The seed of a pseudorandom number generator is just the number (or vector) that is used to initialize the "random" number sequence. This means that a given number, used as a seed, will always result in the same tool outcome (e.g., data subsample), while a different seed value will result in a different outcome. The value used for the seed itself does not need to be random in most use cases.

Many of the tools with the Seed arguments (particularly the Prescriptive and Predictive tools) are written in the R programming language. For these R-based tools, the seed arguments correspond to the set.seed() function in R. The first input to the randomization function is called the seed, which is fed to the R code through the R-based tool's Alteryx configuration.

To see pseudorandom number generation in action, try it out for yourself! Set up an input data set and connect it to a Random % Sample tool. Notice how when you check the Deterministic Output option, setting a seed makes the random sample reproducible for that seed value, where a different seed results in a different subset and unchecking the Deterministic Output option results in a truly random subsample that is not easily or consistently reproducible.

wenjuanchen · ‎11-28-2018

I asked the question on "Set Seed" in a post, then you guys wrote an article. Amazing!

Thank you so much!

Alteryx Designer Desktop Knowledge Base

What is the Set Seed argument, and Why is it There?