This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Alteryx Data Science Design Patterns: Predictive Model Form, Part Five
In our fourth post we learned about fitness functions, which must balance fitting an induction algorithm’s retrodictions and predictions. The last thing we need to review, before riding into the wild west of data-science design patterns, is how a model uses its fitness function to choose values for the parameters of the model’s induction algorithm. That’s the job of a fitting algorithm.
A fitting algorithm is an optimization algorithm that we use to fit a model’s induction algorithm to a given dataset, using the model’s fitness function as the optimization algorithm’s objective function. (A solver is a particular implementation of an abstract optimization algorithm.) The algorithm searches the space of possible sets of parameter values for a set of parameter values that minimizes the fitness function’s value. It would surprise some practicing data scientists to learn that often a single type of induction algorithm can be (and has been) paired with several fitness functions and fitting algorithms, and that a great deal of data-science research is devoted to comparing the behavior of models that differ only in their fittness functions or fitting algorithms.
How you fit [a] model is part of the model. . . . In even moderately complex problems, . . . the details of fitting the model to data force us to recognize that our numerical technique influences our inferences. This is because different mistakes and compromises arise under different techniques. The same model fit to the same data using different techniques may produce different answers. When something goes wrong, every piece of the machine may be suspect.
(Richard McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and Stan (CRC Press, 2016), p. 39. )
Four aspects of a fitting algorithm can influence our inferences:
Choosing a model usually means we’re choosing (perhaps unwittingly) one fitting algorithm over several others, for a given choice of induction algorithm and fitness function. By becoming aware of these implicit choices, we gain control over the tradeoffs in the five areas listed above. Let’s review these areas one by one, so you understand why they matter.
An important feature of an optimization algorithm is its search strategy, its approach to searching the space of feasible (possible) induction-algorithm parameters. The best case occurs when the fitting algorithm’s search method guarantees the algorithm always finds a globally optimal set of parameter values, that is, a set of parameter values that yields the best possible fitness-function value (the global optimum). A special case occurs when the fitting algorithm simply computes a globally optimal parameter set using a formula. Such a solution it termed a closed-form solution. Ordinary least squares linear regression is an example of a model that has a closed-form fitting algorithm.
A given fitting problem may not have a unique globally optimal solution. Several sets of parameter values may yield the same globally optimal fitness-function value. A fitting algorithm may not tell you whether multiple globally optimal solutions exist, especially if it only finds one of them. One way a fitting algorithm can tell you all of the globally optimal solutions is to consider every possible solution. Such an algorithm performs exhaustive search. This strategy guarantees finding a globally optimal solution, but is often impossible or impractical.
Some fitting problems have no known globally optimal fitting algorithm. In such cases we may still have a fitness function or other model-quality metric that tells us how good a solution is, compared to a globally optimal solution. For example, the quality of a solution to a classification problem can be expressed as the percentage of cases correctly classified by the solution, even if no algorithm classifies all cases correctly. In such cases we want a fitting algorithm to produce a near-optimal solution, a solution yielding a fitness-function value or model-quality metric value that approaches the best-possible value. A classifier that is right 95% of the time is, for most purposes, near optimal.
In still other cases we do not even have a fitness function or model-quality metric that tells us how near optimal a solution is. We then want a fitting algorithm whose search strategy gives us reasons to believe the algorithm has found, in some sense, a “good” solution. For example, stochastic search algorithms use randomization strategies that make it likely the algorithm will test feasible solutions scattered throughout the parameter space, in the hope that the diversity of the tested solutions will improve our chances of finding at least one near-optimal solution. Other search algorithms try to find one or more local optima, solutions having the best possible fitness-function value among all feasible solutions that are, in some sense, near the locally optimal solution. Stochastic search and local optimization can combine in a single fitting algorithm, so that the fitting algorithm explores many “neighborhoods” and finds each’s local optimum.
(Defining the subclasses in Figure 1 is well beyond the scope of this blog. Consult any standard reference on optimization, or start with Wikipedia’s article on optimization problems.) When we can classify a fitting problem in one of the subclasses in Figure 1, we can apply optimization algorithms having search startegies tailored to the subclass. For example, one generally applies basis-exchange or interior-point algorithms to solve linear-programming (optimization) problems. Depending on the subclass, we can also know that the algorithm will produce optimal results. The choice of algorithm then depends on other factors.
An algorithm may be stable in any of three senses. It may be
Deterministic: the algorithm always produces the same outputs for a given set of inputs.
Robust: the algorithm produces similar outputs for similar inputs.
Taken together, these properties result in an algorithm whose behavior is reliable and unsurprising.
Recall that stochastic-optimization algorithms generate and consume random numbers to drive their search processes. When the random-number generator is pseudo-random, and the generator is seeded with the same value for repeated executions of the same algorithm with the same input data, the algorithm becomes deterministic. This fact can simplify testing of a stochastic-optimization algorithm.
An algorithm’s performance can be measured and compared along several dimensions. In this blog series, when we refer to algorithm performance, we’ll mean units of work performed per time period (CPU cycle or elapsed time), in a given environment—in particular, for a given quantity of memory and processors—and for a given input size.
Scalability refers generally to how an algorithm’s performance changes with the quantity of input data or computing resources. Some algorithms perform poorly for small datasets but very well for large datasets, or conversely. Some algorithms are embarrassingly parallel, meaning they’re easy to parallelize. In particular, some algorithms can easily be data parallelized in frameworks such as MapReduce. The ease and extent to which an algorithm can be parallelized become important considerations when undertaking big-data analytics, or when the naïve implementation of the best available algorithm is inherently slow.
The issues we outline above sound, and indeed are, mathematically complex. You don’t need to understand them in great detail, to use data-science tools effectively. But you may want to explore the alternatives, if only to see how much you can learn about them, especially when you’re tackling a project with a large potential return on investment. Let’s look under the hood of Alteryx’s SVM tool, to illustrate the point.
Alteryx uses the e1071 R package’s implementation of the support vector machine model. You can discover this fact on the Model Customization tab of the SVM tool’s configuration panel; see Figure 2:
Figure 2: SVM Tool Model Customization Tab
The help link circled in red takes you to the R package documentation. The “References” section of that document’s entry for the svm() function, in turn, tells you that the package is based on the LIBSVM library, and provides links to that library’s documentation. The second reference is the paper “LIBSVM: A Library for Support Vector Machines” by the library’s authors. The R package document tells us that this paper provides “exact formulations of models, algorithms, etc.”
Unless you have a graduate degree in operations research or some similar field, a great deal of the LIBSVM paper may read like Greek. But even a moderately technical reader can learn a few key facts about LIBSVM’s fitting algorithms and the optimization problems they try to solve. For instance:
The library’s main solver solves a quadratic-programming problem (pp. 3, 9-16), a kind of nonlinear-optimization problem that in the best case (positive semi-definite Q with linear constraints) reduces to a linear-programming problem, a problem type having a globally optimal solution. However LIBSVM paper tells us the library’s solver doesn’t assume that Q is positive semi-definite (p. 10). Note too the warning in the e1071 documentation: “Parameters of SVM-models usually must be tuned to yield sensible results!” The LIBSVM beginner’s guide provides a parameter-tuning “recipe for rapidly obtaining acceptable results,” without guaranteeing the “highest accuracy.”
The LIBSVBM authors have studied their algorithm’s worst-case asymptotic performance carefully enough to warn us that “LIBSVM may take considerable training time for huge datasets,” and to provide a sub-sampling tool to deal with such cases (p. 26). So if we’re working with big data, we may need to downsample our dataset before using the SVM tool to build a scoring model.
The library’s authors have done significant practical optimization work to improve the naïve algorithm’s use of memory, so the algorithm can scale to larger datasets (pp. 17-25). So any remaining dataset-size limitations of the SVM tool may largely inhere in the SVM algorithm itself, rather than the tool’s choice of algorithm implementation.
Looking ahead to a future blog post on emsemble patterns (patterns for combining models), we can learn that LIBSVM implements the “one against one” pattern for combining binary-classification models to produce a multi-class classification model (pp. 29-30). This pattern constructs accurate classifiers at the cost of having to construct k(k – 1)/2 binary classifiers (for k target classes). So if k is large, and the dataset is large, we may not want to use Alteryx’ SVN tool for multi-class classification.
Looking ahead again to another post on cross-validation, we can learn that LIBSVM conducts five-fold cross-validation to avoid overfitting, in the process of computing estimates of probability of class membership (pp. 30-32). This might make the SVN tool’s membership-probability estimates more accurate than competing models that do not employ cross-validation in the process of computing membership probabilities.
As this example illustrates, if we’re willing to wade through a lot of mathematical jargon in a library’s technical documentation, we can come away with some useful information about when we should use the library, and how it might compare to alternative implementations.
If you’ve made it this far, you’re now equipped to start learning practical data-science design patterns, patterns you can employ to build your own analytical models. We’ll start with two sets of patterns for building models out of models: cross-validation and ensembles. From there, the sky’s the limit. Stay tuned!