Data Science

SydneyF · ‎08-29-2019

A hyperparameter is a model parameter (i.e., component) that defines a part of the machine learning model’s architecture, and influences the values of other parameters (e.g., coefficients or weights). Hyperparameters are set before training the model, where parameters are learned for the model during training.

Hyperparameter selection and tuning can feel like somewhat of a mystery, and setting hyperparameters can definitely feel like an arbitrary choice when getting started with machine learning. However, hyperparameters do have a significant impact on the performance of a machine learning model, and there are strategies for selecting and optimizing good hyperparameter values.

Hyperparameter tuning is an art… possibly a dark art.

I love this meme

What Exactly are Hyperparameters and Why do They Matter?

Taking a step back to think about all the inputs that go into creating a trained machine learning model, there are three broad categories; training data, parameters, and hyperparameters.

Training (input) data is the dataset that you use to train a model. The training data is what the algorithm leverages (think: instructions to build a model) to identify patterns within the data and exploit the patterns to make predictions. The training data (typically) do not become a part of your model directly; they are just used to teach your model what the data look like, and how to handle the data to make meaningful estimates of the target variable.

Parameters are learned from the training data. They are the weights, thresholds, or coefficients that allow a model to make predictions. When the algorithm "learns" the provided training data to create a model, it is really adjusting the parameters of the final model to make the best possible predictions of the target variable based on the training data. Parameters are ultimately saved as a part of the learned model. Parameters are a part of the model that are automatically customized to fit your specific data and use case.

The third component of a trained model, hyperparameters, are variables that regulate the process of training a model. Hyperparameters typically determine when the model is done being trained, or how many records an algorithm considers at a time during training, among other things.

Unlike parameters, hyperparameters are constant during the training process - they are set prior to model training and are not adjusted during the training process. Hyperparameters have a direct impact on the final values of the model’s parameters, and some hyperparameters (e.g., number of hidden layers in a neural network) also directly impact the structure of the trained machine learning model.

Because hyperparameters define the actual structure of a machine learning algorithm and the process of model training, there is not a way to “learn” these values using a loss function and training data. You can think of hyperparameters as the knobs and levers you turn and pull to make the algorithm return the clearest signal possible within the trained model.

Hyperparameter Tuning

Hyperparameter optimization is usually accomplished by some automated variation of the good old “guess and check” method (called manual search or hand-tuning in the world of hyperparameter tuning when done by a human).

Hand-tuning (guess and check) entails running multiple trials of training your model, where each trial has different hyperparameter values (but the same training data – as with any good experiment, let’s try to minimize the independent variables here 😉). Each “trial” is then evaluated by a specified hyperparameter metric (specified by you). Accuracy calculated with a holdout dataset is a common metric used for hyperparameter tuning. Popular automated versions of guess and check are called grid search and random search.

Grid Search

Grid search involves specifying a list of possible hyperparameter values you’d like to test, and then the algorithm will train models with every possible combination of the provided hyperparameter values and assess the performance of each trained model using a specified metric (e.g., the accuracy of predictions on a test data set).

Let's imagine we are developing a Decision Tree model. We are interested in tuning two hyperparameter values; the minimum number of records needed to allow for a split, and the maximum allowed depth of any node in the final tree.

The minimum number of records needed to allow for a split hyperparameter sets a minimum number of records in the training data that need to end up in a node in order for that node to be split further, and the maximum allowed depth of any node in the final tree limits how many layers of splits the decision tree can have. Both of these hyperparameters can be used to mitigate the model overfitting the data, and they limit the number of splits that can be made in the dataset. For more detail on the hyperparameters for decision trees in Alteryx, please see the Decision Tree tool mastery.

If we want to test values for each of these hyperparameters using grid search, we would provide a list of possible values for each hyperparameter (e.g., 10, 20, and 30 for the minimum records to split a node, and 2, 3, and 4 for maximum depth) and grid search would train a new decision tree model with every possible combination of values to find the best possible set of hyperparameter values.

As you can probably imagine, this can get out of control pretty quickly. In our example, we are only tuning two hyperparameters with three values each, but a total of nine different models will be trained to find the best pair of values. Although hyperparameter tuning is easy to parallelize (or, known as an embarrassingly parallel problem in computer science parlance), it is still a time-consuming process without a guarantee of finding the best possible values.

Another issue with grid search is that it is testing incremental values - so if the best possible value for the minimum number of records is actually 25, grid search will never find it unless it is specifically specified.

Random Search

Random search works a little differently. Instead of giving a list of hyperparameter values for the optimization algorithm to test, you'll provide statistical distributions of hyperparameter values that you'd like the optimization algorithm to test values from. So, for our decision tree example we might provide a normal distribution for the minimum number of records with a mean of 20 and a standard deviation of 5, and a uniform distribution ranging from 1 to 6 for the maximum node depth.

The random search algorithm randomly samples hyperparameter values from the defined distributions and then tests them by generating a model. Like grid search, random search uses a pre-defined metric to determine the best set of hyperparameter values. Random search has been found to be more efficient for hyperparameter optimization - both in theory and in practice. Random search effectively searches a larger configuration space than grid search.

Part of the reason random search typically outperforms grid search is that typically, only a few hyperparameters really matter for a given dataset, and finding the optimal values for these dominant hyperparameters will have more impact than getting an optimal combination of all hyperparameters. The hyperparameters that are important are different across datasets, so there is not a way of knowing which hyperparameters matter most for your specific dataset (i.e., there is no free lunch). Random search is more likely than grid search to find the optimal value for the important hyperparameter values because it searches a larger area for hyperparameter values (given the same computational budget).

Adapted from Bergstra and Bengio 2012

Advanced Hyperparameter Turning Techniques

Gird and random search are among the most common hyperparameter tuning techniques used today, however, both methods leave something to be desired. Both methods iterate through a large number of possible hyperparameter values, applying any information learned in previous iterations to identifying the best hyperparameter values.

Bayesian optimization methods use Bayes' theorem to calculate a probabilistic model of how different hyperparameter values perform in terms of the pre-defined evaluation metric. This effectively allows for hyperparameter tuning where the algorithm takes how previous hyperparameter values have impacted the trained predictive model in determining which values to test next. Bayesian optimization methods have been found to meet or exceed the results of manual hyperparameter tuning by human experts, and produce better results in fewer iterations/trials that grid and random search methods. Fun fact, Bayesian optimization has lineage from the world of geostatistics under the name kriging.

Evolutionary optimization follows a process inspired by the theory of evolution in biology. The algorithm starts by randomly generating many different initial sets of hyperparameter values. The algorithm then trains models with these randomly generated hyperparameter values and assesses them with a defined evaluation metric. The sets of hyperparameter values that perform the worst on the evaluation metric are thrown out, and new values are generated to replace them based on the values that performed best on the evaluation metric.

For both Bayesian and evolutionary optimization methods, the testing process is repeated until the algorithm reaches a pre-defined stopping point (e.g., max iterations = 5) or until improvement in each iteration halts or drops below a pre-defined level. These are hyperparameters for a hyperparameter optimization algorithm 🙂

Reading Tea Leaves to Find the best Hyperparameter Values

Entire companies have been built around hyperparameter tuning - this is a non-trivial task. Finding the optimal hyperparameter values for your model can improve performance, but there is a cost (e.g., time, effort, money, sanity) to return ratio that should be considered before diving into the more complex approaches for hyperparameter turning. For the best possible success with more simple hyperparameter tuning methods, my recommendation is to follow the maxim "Don't Be a Hero".

Ultimately, you will need to pick a starting point for your hyperparameters, and in selecting In the same way that a hypothesis isn’t really a blind guess (it’s a well-informed guess) the initial values of your hyperparameters should be grounded in something. Hopefully, you did a literature review before getting involved with your machine learning project, and hopefully, in that literature review, you came across some relevant to your work. It is best practices to start with the hyperparameter values used is use cases similar to yours, and expand your search space based on that. The methods used for searching (manual, grid, random, or something fancy, as well as whether to use a validation dataset or cross-validation for evaluation) will be determined by the assumptions you're making for your model and the time and resources you have for the task.

Additional Resources

Scikit Learn's documentation on tuning the hyper-parameters of an estimator details the available hyper-parameter tuning techniques in the Python package, as well as suggestions for using them.

Population-based training is an advanced hyperparameter optimization technique developed by researchers at DeepMind, and employs a combination of random search and genetic (evolutionary) optimization techniques. The full paper can be found here, and there is also a slide deck with nice visuals that can be found here.

If you'd like more details on Bayesian optimization for hyperparameter tuning, this article does a nice job describing the intuition.

Data Science

Hyperparameter Tuning Black Magic