Data Science

SydneyF · ‎05-06-2019

Last December, Microsoft announced the general availability of their Azure Machine Learning service. An exciting component of this service is the automated machine learning functionality, which intelligently generates multiple algorithms and hyperparameters for any provided input data set, selecting and returning the best-performing model using an Artificial Intelligence technique Matrix Factorization instead of the more traditional brute force methods.

Today, we are very excited to announce the public preview of two Alteryx Designer tools that connect to Azure Machine Learning automated ML (Machine Learning), allowing you to easily leverage the power of automated machine learning from inside Designer.

With the Azure Machine Learning Training tool, you can send data directly from Designer to be trained by Azure Machine Learning automated ML. Using the Azure Machine Learning Scoring tool, you can use the model you trained for predictions on new data.

Both of these tools are available to download now through the Alteryx Gallery. They are a part of the Predictive District. You do not need to have the R-based Predictive tools installed to use these tools.

To demonstrate how these tools can be leveraged for your own use cases, I thought it would be fun to experiment with a Microsoft-published data set on identifying machines that are likely to be hit with malware. This data set was published as a part of a Kaggle competition hosted earlier this year. The bad news is that the competition ended about a month ago, but the good news is, that means we have benchmarks to compare our results against.

Data Preparation

The most important step of any good data science project is data preparation and data exploration. Having a good understanding of the data set you are working with enables you to make sure the data you provide to the model is clean and accurate, as well as create more effective features (hooray for feature engineering!) A model can only ever be as good as the data you feed into it, so it is important to spend some time here.

Microsoft's Data Science Life Cycle - https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview

The great news is that Azure ML does perform pre-processing and featurization automatically, including missing value imputation and hashing (one-hot encoding). To get a lay of the land, I am still going to take a look at our data in Alteryx and remove some features manually.

Reading in the raw data downloaded from Kaggle, the first thing we need to note is that this data set is on the large end (8,921,483 records and 84 columns – a total of 6.9 GB of data).

For this baseline run, I am just going to take a quick look at the data and remove any features that are unlikely to be useful for predicting the target variable. This will help reduce the overall size of the data being passed to the Azure Machine Learning service.

One of the first things I like to do when working with a new data set in Alteryx destined for machine learning is to plug it into a Field Summary tool.

In the O output of the Field Summary tool, we can see that there are a few fields with more than 95% missing values, as well as fields where more than 95% of the observations have the same value. Because these fields are so skewed, they aren’t going to be particularly meaningful. Using a Select tool, I am going to remove any fields that have more than 95% nulls (highlighted in AYX Blue Razz (light blue)), fields where more than 95% of records have the same value (highlighted in AYX Hot Sauce (dark pink)), or that have a constant value (highlighted in AYX Cotton Candy (light pink)). I’ll remove these fields with a Select tool.

We can also see that there is a nice, nearly 50/50 split of values in the target variable [HasDetections]. Excellent.

Removing these fields has reduced the data set from 84 to 63 fields, and from 6.9 GB of data to 5.2 GB. This is still pretty large, and working with a dataset this size to establish a baseline is going to be slow and somewhat inefficient. To deal with this, I am going to randomly sample the data down to 1 in 20 observations for each target variable value with a Sample tool. I’m also going to add on a Create Samples tool to divide the remaining records into train and test data sets. I’ll hold on to the test data set to run it through the Azure Machine Learning Scoring tool a little later.

Removing these features, as well as down-sampling, still leaves nulls as well as fields with high cardinality, which can be ineffective in machine learning models. For this first pass, I am going to leave this preprocessing to the Azure Machine Learning service, which will impute nulls and drop fields with high cardinality. I can always come back and perform additional cleansing and processing in Alteryx to have a little more control, including some custom feature engineering - model building and refinement is an iterative process after all 😊.

Now I can connect the Azure Machine Learning Training tool to my data stream. Here is what the workflow looks like at this point:

Not so scary, right?

The Azure Machine Learning Training Tool

With the first pass at preprocessing set up as an Alteryx workflow, I can connect an Azure ML Training tool to my data stream and configure the tool in the configuration window. To keep things streamlined, I am going to breeze through the configuration of the tool, but if you’d like to know more please check out the help documentation.

The Configuration window for the Azure Machine Learning Training tool, first we need to name our experiment. The experiment name can be anything – I think for this project, “Malware” has a nice ring to it.

In the Target variable option, I’ll select the variable I’m interested in estimating from a drop-down menu. In this example, it is [HasDetections].

Next, I need to specify what type of model I am trying to build – either a classification experiment or a regression experiment. This option determines what set of algorithms the Azure automated ML will try applying the data to and should depend on the target variable. If the target variable is made up of continuous values, then it’s a regression experiment. In my case, the target variable is made up of categorical values ([HasDetections]: Yes (1) or No (0)), so I’ll select a classification.

The next configuration option, Compute configuration, expands into a larger menu. The Azure Machine Learning Training tool sends data from Alteryx on our local machines to a remote compute resource. This is where we can define the compute resource that is used.

This can all be left as default values, which is what I am going to do! A new compute resource will be created if the one specified does not already exist. Note that if you specify the Min Nodes argument to greater than zero, the Azure Machine Learning services will create a cluster with the minimum nodes specified, and continue to run the nodes after the compute resource is no longer needed.

Advanced Options is another expanding menu. I’m going to leave these values as default as well.

Finally, I need to provide my Azure account information, so the tool knows what to connect to; this includes a Subscription ID, Resource Group, and Workspace Name.

Now we can hit the run button and see what happens!

Checking the Experiment in Azure Portal

The Azure Machine Learning Training tool will give you incremental updates as it pushes data up to the service, prepares the data, and starts running models.

As soon as you see the text Experiment Created: in your results window, you will be able to access the experiment in the Azure portal, in your specified workspace under Experiments.

Opening up the experiment by clicking on the name will show you a dashboard of your runs within the experiment.

Clicking on the Run Number will open a dashboard detailing the iterations of algorithms that were included in a given run.

In this dashboard, you can see the accuracy of different iterations based on cross-validation. Clicking in the Iteration number in the bottom right table will open even more information, including a plethora of metrics on model performance for the specific model.

We are rich with model metrics to assess the models trained by the Azure Machine Learning service! With these metrics, you can keep iterating in an experiment until a model is trained to your satisfaction.

At this point, I’m ready to start feeding the model some new observations to see how it performs.

The Azure Machine Learning Scoring tool

With the model trained and living in my Azure workspace, I can send new, unseen test records to the experiment and the service will return estimated values for the target variable [HasDetections] using the best model in the experiment. The best model in an experiment is defined by the primary metric- accuracy for classification and R² for regression.

It’s important that the data I feed into the Azure Machine Learning Scoring tool has the same schema as the data I used to train the Malware trial experiment – enter the second Select tool into the workflow (to remove the target variable [HasDetections]).

The configuration of the Azure Machine Learning Scoring tool is really straightforward – just give it the directions to find your experiment, and it is good to go. The Azure Machine Learning Scoring tool actually downloads the best model from the Azure Machine Learning service and runs the scoring locally on your machine, so no need to worry about remote computational resources.

I’ll add a few tools (a Join tool, a couple of Formula tools, and a Summarize tool) to calculate the accuracy of the model on the validation data set.

Now we let it run!

With a quick calculation, we see that the model estimated 62% of records correctly, and 38% of records incorrectly.

Considering the top leader in the Kaggle competition got 67% of the test data set correct after months of meticulous feature engineering and hyperparameter voodoo, I think this is a great start. At this stage, I can explore these results further to see where the model is missing things, go back and perform more targeted data preparation and feature engineering to see if it improves performance, or I can just go for it and try running the model on the Kaggle test data set, submit the results and see where it performs against other people’s hard work and tears.

Machine learning is an iterative process and still does require some expertise and understanding, but these tools enable users to rapidly test many different algorithms and tune hyperparameters without much intervention – something that can be very efficient and powerful for both beginners and experts alike. The Azure Machine Learning tools are a fantastic way to do rapid prototyping, then develop full, robust models ready for deployment.

We are very excited to have this included in the Alteryx toolbox – and even more excited to hear how our users put it to work. When you have something neat and exciting to share, please don’t hesitate to comment on this blog or to reach out to us at community@alteryx.com to write a blog of your own! Please note that these tools are currently in Beta and not supported for production use - check back on the tools in the fall for an update!

Seem · ‎05-07-2019

This is really great to see how optimization of models is enhanced by Azure.

Thank you!

Seema

stevenmccord · ‎10-05-2019

The tool seems to have been removed from the gallery. When I click on the link I get a "page not found" error.

StephenW · ‎10-05-2019

@stevenmccord Apologies, we were updating some things and they were briefly unavailable. You can find the tools here: https://gallery.alteryx.com/#!app/Microsoft-Azure-Machine-Learning-Tools/5d8a72c78a933711e4465966

Inactive User · ‎01-30-2020

Is this still in beta? @SydneyF @StephenW

StephenW · ‎01-31-2020

@Inactive User Yes, the tools are still in beta.

Tokimatsu · ‎08-30-2020

The following error occurred. We have more than one tenant in our company and this is likely the cause. Could you please allow me to enter a tenant ID?

Error: Azure ML Training (367): UserErrorException:
    Message: You are currently logged-in to 55f0a...-.............................................. tenant. You don't have access to d846a...-............................................. subscription, please check if it is in this tenant. All the subscriptions that you have access to in this tenant are =
[SubscriptionInfo(subscription_name='Microsoft Azure Enterprise', subscription_id='754c5...-...........................')].
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.
    InnerException None
    ErrorResponse

DaisukeTsuchiya · ‎09-01-2020

We would like to use this tool but can't log in to Azure ML correctly. We have several Tenant ID then log in to another tenant for office 365 not Azure ML.
Could you add Tenant ID into Azure credentials?

DaisukeTsuchiya · ‎09-06-2020

@StephenW @SydneyF
Any update or plan?

StephenW · ‎09-08-2020

Hi @DaisukeTsuchiya,

Thank you for posting the feedback. At this time, we don't have any plans on making updates to the tools.

Thanks,

Steve

KylieF · ‎09-08-2020

Hi @DaisukeTsuchiya !

Love the idea, though suggestions for improvements to the product are best directed to our Designer Idea Board. That's the best location to get visibility both with the community and our product team, so I'd recommend posting your idea there.

DaisukeTsuchiya · ‎09-08-2020

Hi @KylieF !

Thank you for your advice! I posted it to Designer Idea Board.

https://community.alteryx.com/t5/Alteryx-Designer-Ideas/Addition-of-Tenant-ID-to-The-Azure-Machine-L...

StephenW · ‎09-09-2020

The Azure ML Tools have been deprecated. If you'd like to use the tools as a starting point to create your own custom version they can be downloaded from here.

DaisukeTsuchiya · ‎09-09-2020

@StephenW

Thank you for your quick reply. We are sad that the Azure ML Tools have been deprecated. We would like to customize it by us.

1) Missing Scoring Tool
We installed Azure ML tool from the link, downloaded from here but only Azure ML Training tool was installed and Azure ML Scoring Tool is missing.

Both tools were installed before from below link.
https://gallery.alteryx.com/#!app/Microsoft-Azure-Machine-Learning-Tools/5d8a72c78a933711e4465966

Are there any changes?

2) Customizing method
Macro can't be open for Azure ML tool. Is it possible for us to enable opening macro and customizing?

Or, should we modify something in local?
(Should we add tenant ID to "main.py" in "C:\Users\"UserName"\AppData\Roaming\Alteryx\Tools\AzureML_Train" ?)

I am an enthusiastic fan of Alteryx and heavy user but am not familiar with programming or python and just started to use Azure ML

I would highly appreciate your advice.

Thank you in advance.

StephenW · ‎09-10-2020

@DaisukeTsuchiya

1) The yxi includes both the Train and Score tools (it's the same file that was linked from the Gallery app). I'm not sure why the Score tool is no longer showing up for you.

2) Unfortunately, the tools are built using our Python SDK and would require Python programming to customize.

DaisukeTsuchiya · ‎09-10-2020

@StephenW

Thank you for your quick response.

1) I installed yxi again and both tools are available now. Thank you.

2) I modified main.py and managed to connect to our correct tenant ID. But there are other errors during auto ML and I will discuss with our programmer.

Data Science

Introducing: The Azure Machine Learning Training and Scoring Tools