Data Science

SydneyF · ‎06-12-2019

The 2019.3 Beta includes a new predictive analytics application; Assisted Modeling, as well as a whole new palette of predictive tools to play with. These are the Python-based Machine Learning tools.

To get these tools you will need to:

Join the Beta Program
Download and install the 2019.3 Beta
Download and install the Machine Learning Tools .yxi file.

Can you feel the anticipation?

With this installation, you will have access to five new tools; the Start Pipeline tool, the Transformation tool, the Classification tool, the Fit tool, and the Pyscore tool.

The way the Machine Learning tools work is a little different from the R-based Predictive tools you might already be familiar with. Instead of each tool containing code to train a separate model, the Machine Learning tools are designed to be used together to create a modeling “pipeline”.

In this context, a modeling pipeline refers to the steps needed to get your data from a raw input data format to a fully trained and useable model, including any encoding or transformation performed on the data to get it into a useable format. You can think about the Machine Learning tools like little Lego blocks to build out a model. Each tool can be used to perform an individual process, from identifying your target variable and marking the start of your pipeline (Start Pipeline tool) to combining all of your tools into a list of instructions and fitting the transformed data to a model (Fit tool). By stacking all of your lego blocks together, you end up with a fully fleshed-out tower with walls and a roof, not just the fancy spiral stairs. When your trained model object is created and output, it includes each of the steps in your built-out pipeline.

When you are building out a Machine Learning pipeline, the experience is similar to working with other Alteryx tools. Each tool passes metadata to the next tool, allowing you to configure downstream tools while accounting for any changes upstream tools might make to the data. However, when you run the workflow the Machine Learning tools do not actually execute processes in sequence on the canvas. Instead, the instructions each tool contains are actually passed down through the pipeline and executed in order as a single "to do list" by the Fit tool at the end of the pipeline.

With this overview in mind, let’s examine the pipeline architecture by examining each of the tools individually.

The Start Pipeline tool

In the beta release, any Machine Learning Pipeline needs to start with the Start Pipeline tool (was that sentence as fun to read as it was to write?). This is the tool you feed your input data to, and where the Python-based machine learning process starts.

The configuration of the Start Pipeline tool is simple – all you need to do is specify your target variable. It is important to note here than in this beta release, the Machine Learning tools only support Classification models, so your target variable will need to be a categorical one.

The Transformation tool

This tool is used to modify your data into machine learning-usable formats. It includes the ability to select columns to include in your model, perform data typing, impute missing values, and perform one hot encoding for categorical variables. This tool includes a lot of important functionality for preprocessing your data, so expect to use it more than once in your pipeline.

In the Configuration window, you will need to Select a transformer. This will define the way you are using the tool.

Selecting Column selection as your transformer causes the transformation tool to behave like an Alteryx Select tool, allowing you to remove any variables that should not be included as predictor variables in your model.

Selecting Data Typing as your Transformer also causes the Transformation tool to behave like a Select tool (a different part of the Select tool), where you can adjust the data types of each of your variables.

The Missing value imputation transformer is where you can handle nulls in your dataset. This step is particularly cool with the pipeline architecture because it is a best practice in data science to impute values to any future data sets with values derived from your training data set. Because the imputation becomes a part of the pipeline, the values calculated with your training data set to replace nulls are saved with your model, and all values fed into the model in the future with nulls will use the values derived from your training data set.

Even though the Iris dataset does not have null values, I am doing to include this transformation step so that my resulting model is primed to handle any nulls fed into the model later.

The last transformer, One Hot Encoding, allows you to encode any categorical variables in your dataset, putting them into a format that the models in the Machine Learning tools (based on the Python package scikit-learn) can handle them correctly.

My dataset does not have any categorical variables other than the target variable, so I’ll skip this transformer in my pipeline.

An important thing to note about the Transformation tool is that it can only serve one purpose at a time. This means that if you need to select columns in your dataset and change data types for your incoming dataset, you will need to have two separate Transformation tools in your pipeline (this is why you should expect to have more than one Transformation tool in a pipeline).

This beta release is just the beginning for the Transformation tool. The developers are hard at work adding a variety of other preprocessing steps for use in a pipeline. Stay tuned. 🙂

The Classification tool

This is where the classification model choices live. In the beta release, the Classification tool includes logistic regression, random forest, and decision tree (with plans to add more).

Depending on which algorithm (model recipe) you select, different General and Advanced parameters will be populated in the Configuration window.

Here's the Random Forest Configuration Window

There should only be one Classification tool in a single pipeline stream, but you could split your pipeline into multiple streams to train multiple models.

The algorithms available in the Machine Learning tools will increase over time, with plans to add more classification algorithms and regression algorithms.

The Fit tool

The Fit tool is the final step in the pipeline process, and where you close off your pipeline. The output of a Fit tool is your model, including any transformation steps you made along the way. Nothing is required in the Configuration window, just add it to the end of your pipeline!

Pyscore

If you’d like to use your model on an unseen dataset, you will need to use the Pyscore tool! Here is where the pipeline concept might click for you if it hasn’t yet – when you feed data into the Pyscore tool, you want to feed in data in the same format that you fed it into the Start Pipeline tool. All of the preprocessing we performed with the Transformation tools are included in the model object output by the Fit tool.

The Pyscore tool doesn’t require any configuration, just hook up your data input to the “D” anchor, and the model to the “M” input anchor and click Run!

Sweet sweet predicted records

Hopefully, this example has made the architecture of the new Machine Learning tools clear! Pipelines are a really neat approach to predictive modeling in Alteryx that allow models to be neatly packaged with all of the preprocessing steps included. This means that when you share or deploy your model, data can be fed into it in the same format that you fed data into your Start Pipeline tool. No need to worry about sharing your preprocessing steps separately from a model object.

If you’re ready to try the shiny new Machine Learning tools out for yourself, please enroll in our beta program at beta.alteryx.com.

CN_BOI · ‎06-12-2019

Very impressive feature. This is fantastic for Alteryx considering the continuous rise of Python in the data science community. I particularly like the one-hot encoding. If Alteryx can also include automated feature engineering using 'featuretools' python library that'll definitely be a game changer!

Well-done guys.

Love it

Charles

natej · ‎06-17-2019

Awesome.

I hope this eventually includes models like XGBoost, LightGBM, etc. as well as some sort of grid/random search for parameter tuning.

diksha1107 · ‎06-17-2019

Hey,

This is really an awesome development. But I am not able to find some of these tools in my alteryx. Can you please tell me how should I import other tools?

Infact there are various other tools, I am not able to import those.

Thanks in advance!

SydneyF · ‎06-18-2019

Hi @diksha1107,

These tools are a part of the 2019.3 Beta release. In order to use them, you will need to;

Join the Beta Program.
Download and install the 2019.3 Beta.
Download and install the Machine Learning Tools .yxi file. This file will also be available through the Beta Portal. It is a separate file from the Designer install. Running the .yxi file on your machine should automatically install the tools to use in Designer.

Hope this helps!

asilva · ‎06-20-2019

@SydneyF Will you be able to set your seed for randomness when running these types of assisted modeling? That way results can be reproducible when building models?

SydneyF · ‎06-24-2019

Hi @asilva,

There is a Random seed option in the Advanced Parameters menu in the Classification tool for each of the current models, making the results of the model building process reproducible.

So, to answer your question, yes :)

Thanks!

Sydney

Benson · ‎06-24-2019

Hello Sydney,

Does one need pay to join the beta program?

Thanks in advance,

Benson

SydneyF · ‎06-24-2019

Hi Benson,

Nope, the Beta Program is free to join! You just need to apply here. If you have any further questions, please reach out to betas@alteryx.com.

Thanks,

Sydney

rag-ryx · ‎07-01-2019

Thank you @SydneyF for such detailed blogs on machine learning tools/assisted modeling. Loved reading them.

Best,

Raghav

govindarajand · ‎07-07-2019

Could you please tell how this is different from the already present tools in the stable version?

Like we have Linear regression, forest model and such in the Predictive capabilities of Alteryx.

SydneyF · ‎07-08-2019

Hi @govindarajand,

There are two differences that immediately come to my mind when thinking about the current predictive tools and the new machine learning tools.

The first is that these new machine learning tools are based in Python, where the predictive tools are based in the R programming language. This means that the tools will have slightly different implementations for each algorithm (e.g., logistic regression in R vs. logistic regression in Python). There is a nice comparison of the two languages from Dataquest you can read here.

The other major difference that comes to mind is the pipeline architecture of the machine learning tools. With the R-based predictive tools, each individual tool is essentially an individual recipe to train a model. With the Python-based machine learning tools, you can add extra pre-processing steps to your model "recipe" and have it included in the final model object by leveraging a pipeline, as described in this blog post. This is useful because all pre-processing steps are contained in the model object, making it easier to deploy or share the final model.

The new machine learning tools also support the Assisted Modeling application, which is currently in Beta as well.

There are other differences between the two suites of tools, but I think these are the most important at a high-level. I hope this helps!

govindarajand · ‎07-09-2019

Hi @SydneyF,

Thanks for the explanation of the differences!

AJacobson · ‎10-04-2019

@CN_BOI We love your idea on adding FeatureTools capabilities to the product. Great idea. Check out this:

https://investor.alteryx.com/news-and-events/press-releases/press-release-details/2019/Alteryx-Acqui...

and

https://www.alteryx.com/press-releases/2019-10-04-alteryx-acquires-feature-labs-to-advance-machine-l...

CN_BOI · ‎10-07-2019

This is amazing news. Great to see that Alteryx picked up my suggestion on featuretools and have now bought this company - FeatureLabs. Great validation! Again, this is particularly interesting considering that Python is now the most popular data science language and feature engineering has the most influential outcome on any ML model. So, I am truly excited. Well-done @AJacobson and I'm sure @SydneyF played a major role in this decision as well. Kudos to the Alteryx team!

My only question is - when can I start playing with this new toy 🙂 Do we have a timeline for its integration into Alteryx?

Cheers,

Charles

MDang · ‎11-03-2019

Wow.. this is great @SydneyF. thank you for the article. with the recent latest Alteryx version, I also have another new tool call "Predict" within the Machine Learning category. When I tried to run it I get an error "traceback: main.py line 8 not found". Does anyone have more information to share on this? how does it differ from the existing Pyscore stool? how can I get this to work? thank you in advance, mdang

DavidCo · ‎11-04-2019

@MDang it sounds like you might have some tools available that are leftover from the 2019.2 or 2019.3 beta. In this case, the Predict tool is a replacement for the PyScore tool. The error you're seeing leads me to believe there is some incompatibility between the tool versions you're using since they're still in beta. The tools can be deleted by navigating to %APPDATA%\Alteryx\Tools and deleting the appropriate directories.

MDang · ‎11-04-2019

@DavidCo thank you for this. is there more information on how I can get the "Predict" Tool to work properly? I have gone to the appropriate directory and deleted the "Pyscore" folder/tool. However, the Predict tool is still error-ing. Currently, Alteryx Machine Learning and the Predict Tool is still in Beta versions? It has not been officially rolled out to regular production versions? I already have the latest BETA 2019.4 (version 2019.3.0.17508) installed in my User directory (i.e. C\User\...) I re-installed my Alteryx designer (production version 2019.3.5.17947) in my admin directory (i.e. C:\program files\) and I reinstalled the Machine Learning Tools .yxi file (ML_Tool.yxi) which I downloaded a few months ago. still getting the same error. sorry to make you my tech support. but any help will me very much appreciated and thank you in advance. MDang

MDang · ‎11-04-2019

@DavidCo please ignore my question above. I figured it out. I deleted the pyscore and re-installed the BETA 2019.4 and it is working now. thanks again for showing me how to delete the Pyscore Tool. mDang

Data Science

What's a Pipeline? An Overview of the New Python-based Machine Learning Tools