Featured on the Data Science Portal.
In this series, we will be looking at different machine learning and data science competitions and see how the new Assisted Modeling tools for Alteryx perform in these competitions and show the ease of use of these tools to build fast and effective models for predictive analysis.
We'll start out first with one of the most basic and well-known Kaggle competitions often used in Data Science and Machine Learning courses: Titanic: Machine Learning from Disaster. This competition challenges competitors to predict whether a given person survived the historic sinking of the Titanic based on metadata regarding the individual. We will walk through the most straightforward approach to solving this problem using the new Assisted Modeling tools and compare the results of our trained models against other competitors.
To follow along and get the most out of this post it is recommended that you have Alteryx Designer installed, and the properly licensed Alteryx Intelligence Suite tools installed alongside Designer. You can download and install Alteryx Designer from the Download Page.
First off we will need the Titanic dataset to train our machine learning model. You can download the dataset from the Kaggle website. The data is split into three parts:
After downloading the data you can open Alteryx Designer and import these datasets into a new workflow using the
Input Data tool. Use a
Browse Tool to inspect the contents of the data to get a feel for what you have to work with for this challenge. Once you feel you understand the contents of the csv data you're ready to move on.
Alternatively, I've provided a .yxzp file that you can use for a starting place. This file is structured similarly to the Weekly Challenges so it is pretty intuitive to follow, though pretty bare.
Now that you have your data available on the canvas you're ready to start building your machine learning model with Assisted Modeling!
Start out by dragging a Modeling tool down from the Machine Learning tool tab and connect it to the train.csv
Input Data tool. Don't worry about setting the datatypes with a
Select tool. Assisted Modeling will help you set the datatypes through the assisted wizard.
After connecting the Assisted Modeling tool, run your workflow to pump the training data to the
Modeling tool, select the
Modeling tool and click "Start Assisted Modeling." When you arrive at the first step to select a target variable, choose "Survived" as your target. This is what we want to predict for the Kaggle competition. You'll notice that both Classification and Regression are available as Machine Learning Methods. You can choose either method. We will discuss any particularities that will be required for either method later on in this post. For the rest of the steps, we are going to select the defaults along the way. If you are feeling particularly bold, you can change any of the default recommendations from Assisted Modeling.
When you arrive at the leaderboard screen, you can pick the top-performing model and add it to the canvas, or if you're feeling adventurous, you can add all of the models and try them all out. Kaggle allows 10 submissions a day so you won't be penalized for trying different options. For this post, we will pick the top model and add it to the canvas.
Now that we have a model added to the canvas and a
Predict tool available, we're ready to generate our predictions and export them to a CSV to upload to Kaggle. After we've submitted our predictions we'll be able to compare our results against the best and worst submissions for the competition.
First, we need to construct our output data. Start by adding an
Input Data tool and connecting it to the "test.csv" data source. Easy enough. If you connect the "test.csv" data directly to the
Predict tool, you'll come across an obscure error when running the workflow. The "Fare" column in the test.csv contains null values that the model cannot handle. There are two ways that you can resolve this issue; one is very easy, the other a bit more difficult.
The easy way to resolve this issue is to select your
Transformation tool that cleans up null values in your Assisted Modeling pipeline and check the box for the "Fare" column and set the dropdown to "Replace with Median." The reason this is not done by default is that the "train.csv" did not contain null values for the "Fare" column so the checkbox was not set. Sometimes when you are dealing with unknown data that may contain null values, it's best to check all of the columns and decide how those nulls should be properly handled.
The hard way is more in line with the Designer paradigm. You can select the "Fare" column with a
Select tool and then use a
Summarize tool to calculate the median fare. Then use a
Formula tool to find the null values in the dataset and replace them with the median number. An example of this can be found in the "TitanicSolvedClassification.yxzp" and "TitanicSolvedRegression.yxzp" files attached to this post.
Now we're ready to connect the cleaned data to the
Predict tool. But before we start making predictions, we need to format the data properly for the Kaggle submission. Drag a
Select tool and connect it to the output anchor of your
Predict tool. Select only the "PassengerId" and "Survived_predicted" column and rename that column "Survived." If you used a regression algorithm for your model, you also need to add a
Formula tool to the
Select tool and use the
Formula tool to round your predictions to either 0 or 1. An example of this can be found in the "TitanicSolvedRegression.yxdb" file.
Last, you need to export your data to a CSV. Drag an
Output Data tool to the canvas and connect it to your formatted data (either a
Select tool or a
Formula tool depending on whether you chose Classification or Regression) and select an output file location for the predictions CSV. Now run the workflow and you're ready to submit your predictions on Kaggle!
Go to the Titanic Competition Submit Page to submit your predictions. Click the Up Arrow icon under Step 1 and select your predictions CSV from your
Output Data tool. You can add an optional description as well if you feel. It may help to include aspects of your workflow in the description to keep track of multiple submissions with various tweaks to enhance your score. Last, click the "Make Submission" button and you will see your score.
Both the default classification and Regression models from Assisted Modeling yield the same score.
Now comes the really fun part! Play around with your model, select different features, try to generate your own new features from data, or play around with any other aspects of your assisted modeling pipeline and see if you can get a better score! There are no wrong answers here, you'll just find things that work and things that don't. If you find a better score, feel free to upload your workflow and discuss what you did to produce a more accurate model.
I'll give a shoutout in the subsequent Tackling Competitions Post to the person with the best score in the previous post as an added incentive.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.