Data Science

Machine learning & data science for beginners and experts alike.
dylanjsherry
Alteryx Alumni (Retired)

image-2.png

 

Alteryx hosts two open-source projects for modeling.

 

Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning.

Compose is a tool for automated prediction engineering. It allows you to structure prediction problems and generate labels for supervised learning.

We’ve seen Featuretools and Compose enable users to easily combine multiple tables into transformed and aggregated features for machine learning, and to define time series supervised machine learning use-cases.

The question we then asked was: what happens next? How can users of Featuretools and Compose build machine learning models in a simple and flexible way?

 

We’re excited to announce that a new open-source project has joined the Alteryx open-source ecosystem. EvalML is a library for automated machine learning (AutoML) and model understanding, written in Python.

 

import evalml

# obtain features, a target and a problem type for that target
X, y = evalml.demos.load_breast_cancer()
problem_type = 'binary'
X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(
    X, y, problem_type=problem_type, test_size=.2)
    
# perform a search across multiple pipelines and hyperparameters
automl = AutoMLSearch(X=x, y=y, problem_type=problem_type)
automl.search()

# the best pipeline is already refitted on the entire training data
best_pipeline = automl.best_pipeline
best_pipeline.predict(X_test)

 

EvalML's AutoML search in actionEvalML's AutoML search in action

 

EvalML provides a simple, unified interface for building machine learning models, using those models to generate insights and to make accurate predictions. EvalML provides access to multiple modeling libraries under the same API. EvalML supports a variety of machine learning problem types including regression, binary classification and multiclass classification. Custom objective functions let users phrase their search for a model directly in terms of what they value. Above all we’ve aimed to make EvalML stable and performant, with ML performance testing on every release.

 

What’s Cool about EvalML

 

1. Simple Unified Modeling API


EvalML reduces the amount of effort it takes to get to an accurate model, saving time and complexity.

EvalML pipelines produced by AutoML include preprocessing and feature engineering steps out of the box. Once users have identified the target column of the data which they’d like to model, EvalML’s AutoML will run a search algorithm to train and score a collection of models, will enable users to select one or more models from that collection, and to then use those models for insight-driven analysis or to generate predictions.

EvalML was designed to work well with Featuretools, which can integrate data from multiple tables and generate features to turbocharge ML models, and with Compose, a tool for label engineering and time series aggregation. EvalML users can easily control how EvalML will treat each inputted feature, as a numeric feature, a categorical feature, text, date-time, etc.

 

You can use Compose and Featuretools with EvalML to build machine learning modelsYou can use Compose and Featuretools with EvalML to build machine learning models

 

EvalML models are represented using a pipeline data structure, composed of a graph of components. Every operation applied to your data by AutoML is recorded in the pipeline. This makes it easy to turn from selecting a model to deploying a model. It's also easy to define custom components, pipelines and objectives in EvalML, whether for use in AutoML or as standalone elements.

 

2. Domain-Specific Objective Functions

 

EvalML supports defining custom objective functions which you can tailor to match your data and your domain. This allows you to articulate what makes a model valuable in your domain, and to then use AutoML to find models which deliver that value.

 

The custom objectives are used to rank models on the AutoML leaderboard during and after the search process. Using a custom objective will help guide the AutoML search towards models which are the highest impact. Custom objectives will also be used by AutoML to tune the classification threshold of binary classification models.

The EvalML documentation provides examples of custom objectives and how to use them effectively.

 

3. Model Understanding


EvalML grants access to a variety of models and tools for model understanding. Currently supported are feature importance and permutation importance, partial dependence, precision-recall, confusion matrices, ROC curves, prediction explanations, and binary classifier threshold optimization.

 

An example of partial dependence from the EvalML documentationAn example of partial dependence from the EvalML documentation

 

4. Data Checks


EvalML's data checks can catch common problems with your data prior to modeling, before they cause model quality problems or mysterious bugs and stack traces. Current data checks include a simple approach to detecting target leakage, where the model is given access to information during training which won’t be available at prediction-time, detection of invalid datatypes, high class imbalance, highly null columns, constant columns, and columns which are likely an ID and not useful for modeling.

 

target_leakage_2.gif

 

 

Getting Started Using EvalML

 

You can get started with EvalML by visiting our documentation page, where we have installation instructions as well as tutorials which show examples of how to use EvalML, a user guide which describes the components and core concepts of EvalML, API reference and more. The EvalML codebase lives at https://github.com/alteryx/evalml. To get in touch with the team, check out our open-source slack. We are actively contributing to the repository and will respond to any issues you post.

 

What’s Next?

 

EvalML has an active feature roadmap, including time series modeling, parallel evaluation of pipelines during AutoML, upgrades to the AutoML algorithm, new model types and preprocessing steps, tools for model debugging and model deployment, support for anomaly detection, and much more.

Want to hear more? If you’re interested in hearing about updates as the project continues, please take a moment to follow this blog, star our repo in GitHub, and stay tuned for more features and content on the way!