Data Science

Machine learning & data science for beginners and experts alike.
It's the most wonderful time of the year - Santalytics 2020 is here! This year, Santa's workshop needs the help of the Alteryx Community to help get back on track, so head over to the Group Hub for all the info to get started!
Alteryx
Alteryx

Featured on the Data Science Portal.

 

Assisted Modeling Blog Series: Tackling Competitions

 

In this series, we will be looking at different machine learning and data science competitions and see how the new Assisted Modeling tools for Alteryx perform in these competitions and show the ease of use of these tools to build fast and effective models for predictive analysis.

 

We'll start out first with one of the most basic and well-known Kaggle competitions often used in Data Science and Machine Learning courses: Titanic: Machine Learning from Disaster. This competition challenges competitors to predict whether a given person survived the historic sinking of the Titanic based on metadata regarding the individual. We will walk through the most straightforward approach to solving this problem using the new Assisted Modeling tools and compare the results of our trained models against other competitors.

 

 

Pre-Requisites

To follow along and get the most out of this post it is recommended that you have Alteryx Designer installed, and the properly licensed Alteryx Intelligence Suite tools installed alongside Designer. You can download and install Alteryx Designer from the Download Page.

 

 

Download the Dataset

First off we will need the Titanic dataset to train our machine learning model. You can download the dataset from the Kaggle website. The data is split into three parts:

 

  • train.csv - The training data for our machine learning model.
  • test.csv - This is the data we will be using in our predictions which will be scored by Kaggle. This data has not been seen by our model before.
  • gender_submission.csv - This is an example of how the data needs to be formatted for submission.

 

After downloading the data you can open Alteryx Designer and import these datasets into a new workflow using the Input Data tool. Use a Browse Tool to inspect the contents of the data to get a feel for what you have to work with for this challenge. Once you feel you understand the contents of the csv data you're ready to move on.

 

Alternatively, I've provided a .yxzp file that you can use for a starting place. This file is structured similarly to the Weekly Challenges so it is pretty intuitive to follow, though pretty bare.

 

Now that you have your data available on the canvas you're ready to start building your machine learning model with Assisted Modeling!

 

KaggleDataDownload.PNG

 

 

Building the Model

Start out by dragging a Modeling tool down from the Machine Learning tool tab and connect it to the train.csv Input Data tool. Don't worry about setting the datatypes with a Select tool. Assisted Modeling will help you set the datatypes through the assisted wizard.

 

After connecting the Assisted Modeling tool, run your workflow to pump the training data to the Modeling tool, select the Modeling tool and click "Start Assisted Modeling." When you arrive at the first step to select a target variable, choose "Survived" as your target. This is what we want to predict for the Kaggle competition. You'll notice that both Classification and Regression are available as Machine Learning Methods. You can choose either method. We will discuss any particularities that will be required for either method later on in this post. For the rest of the steps, we are going to select the defaults along the way. If you are feeling particularly bold, you can change any of the default recommendations from Assisted Modeling.

 

 

Compare the Models

When you arrive at the leaderboard screen, you can pick the top-performing model and add it to the canvas, or if you're feeling adventurous, you can add all of the models and try them all out. Kaggle allows 10 submissions a day so you won't be penalized for trying different options. For this post, we will pick the top model and add it to the canvas.

 

 

Construct the Output Data

Now that we have a model added to the canvas and a Predict tool available, we're ready to generate our predictions and export them to a CSV to upload to Kaggle. After we've submitted our predictions we'll be able to compare our results against the best and worst submissions for the competition.

 

First, we need to construct our output data. Start by adding an Input Data tool and connecting it to the "test.csv" data source. Easy enough. If you connect the "test.csv" data directly to the Predict tool, you'll come across an obscure error when running the workflow. The "Fare" column in the test.csv contains null values that the model cannot handle. There are two ways that you can resolve this issue; one is very easy, the other a bit more difficult.

 

Easy Way:

The easy way to resolve this issue is to select your Transformation tool that cleans up null values in your Assisted Modeling pipeline and check the box for the "Fare" column and set the dropdown to "Replace with Median." The reason this is not done by default is that the "train.csv" did not contain null values for the "Fare" column so the checkbox was not set. Sometimes when you are dealing with unknown data that may contain null values, it's best to check all of the columns and decide how those nulls should be properly handled.

 

Hard Way:

The hard way is more in line with the Designer paradigm. You can select the "Fare" column with a Select tool and then use a Summarize tool to calculate the median fare. Then use a Formula tool to find the null values in the dataset and replace them with the median number. An example of this can be found in the "TitanicSolvedClassification.yxzp" and "TitanicSolvedRegression.yxzp" files attached to this post.

 

Now we're ready to connect the cleaned data to the Predict tool. But before we start making predictions, we need to format the data properly for the Kaggle submission. Drag a Select tool and connect it to the output anchor of your Predict tool. Select only the "PassengerId" and "Survived_predicted" column and rename that column "Survived." If you used a regression algorithm for your model, you also need to add a Formula tool to the Select tool and use the Formula tool to round your predictions to either 0 or 1. An example of this can be found in the "TitanicSolvedRegression.yxdb" file.

 

Last, you need to export your data to a CSV. Drag an Output Data tool to the canvas and connect it to your formatted data (either a Select tool or a Formula tool depending on whether you chose Classification or Regression) and select an output file location for the predictions CSV. Now run the workflow and you're ready to submit your predictions on Kaggle!

 

Regression WorkflowRegression Workflow

 

 

Submit the Predictions

Go to the Titanic Competition Submit Page to submit your predictions. Click the Up Arrow icon under Step 1 and select your predictions CSV from your Output Data tool. You can add an optional description as well if you feel. It may help to include aspects of your workflow in the description to keep track of multiple submissions with various tweaks to enhance your score. Last, click the "Make Submission" button and you will see your score.

 

 

Titanic Classification ScoreTitanic Classification Score

Titanic Regression ScoreTitanic Regression Score

 

 

Both the default classification and Regression models from Assisted Modeling yield the same score.

 

Explore

Now comes the really fun part! Play around with your model, select different features, try to generate your own new features from data, or play around with any other aspects of your assisted modeling pipeline and see if you can get a better score! There are no wrong answers here, you'll just find things that work and things that don't. If you find a better score, feel free to upload your workflow and discuss what you did to produce a more accurate model.

 

I'll give a shoutout in the subsequent Tackling Competitions Post to the person with the best score in the previous post as an added incentive.

Happy exploring!