Data Science

MichaelF · ‎08-09-2018

Part 1 of this series covered feature engineering and part 2 dealt with missing data. In this third and final post, we'll predict which Titanic passengers would survive.

4. Prediction

So now that we’re treated all our variables, let’s get into the actual prediction. Let’s bring in the Output from part 3 and split up our data into the original Train data and Test data, which is as easy as using a Filter Tool.

4.2-3 Building the Model and Variable Importance

Since we used the Forest Model for our imputation, let’s keep things consistent and use one to build our model. We’ll attach our train data to the Tool, and use Survived as our target and our predictors will be Pclass, Gender, Age, Sibsp, Parch, Fare, Embarked, Title, FsizeD, Child, and Mother. We’ll keep everything else as default and drop a Browse out of the Report Output so we can see our Results.

From the Report, there are two important plots that can tell us a good deal about our model. The first is the Model Error plot.

Model Error Plot.png

The Out of Bag error rate, which is what Forest models use as it’s validation error rate, is around 20%, which probably isn’t the best, but it’s what we’ll go with. Our error rate for Survival is higher than our error rate for Death (30% > 10%), which means we’re much better at predicting death. Hooray?

The other plot coming from the Report Output is the Variable Importance Plot.

Variable Importance Plot.png

This basically tells us which variables were the most important to the model. We can see that “Title” has the highest importance, so it’s a good thing we created it! I will say that I expected Pclass to be higher up, but I suppose having money doesn’t always save you from being iced. If you want to learn more about these two graphs and how they were created (in terms of theory and to get a more complete understanding), I will once again plug in the Random Forest article I shared earlier.

Forest Workflow.png

4.4 Prediction!

Now that we have our model, let’s plug in a Score Tool to make predictions for our test data using the Model object that comes out of Forest Model’s Object Output. We don’t need to customize any of the settings in the Score Tool, because we didn’t do anything crazy with our train data (like oversampling, transforming it, etc.). Our output from the Score Tool gives us two new columns, Score_0 and Score_1, which represent the probabilities of each outcome.

Outcome Probabilities.png

To get a predicted Yes/No value, let’s use a Formula Tool to say anything under the Score_1 field that has a probability greater than or equal to 0.5, will be survived, and anything else will be not survived.

Prediction Workflow.png

4.5 Submission

If you’ve been following the R blog, this extends a little past that. To see how accurate our model is, we’ll have to make the submission file in the format that Kaggle accepts, which is a .csv with two fields – PassengerID and Survived. To do this we’ll just use a Select and an Output Tool.

Now that we have our submission.csv, let’s submit to Kaggle and get out results! Just go to https://www.kaggle.com/c/titanic and click the “Submit Predictions” button and drag our submission.csv in. Kaggle will automatically calculate your score and put it in the leaderboard.

Woah! 78% seems pretty good for just a simple model! I’m sure we could get this better, but this is a great starting point. If you’re looking at the leaderboard I’ve read that 80-84% is really good, and anything north of that might be considered cheating (considering the names of those who survived is publicly available). So 78% seems mighty fine, and there’s definitely room for improvement.

Note: For those that went the R Tool route (the ones that used the MICE package for our predictive imputation), let’s pop that into Kaggle and see how that one did.

It looks like it performed the same, which makes sense, considering the difference in the histograms were very small.

5. Conclusion

So that kind of wraps up our run at the Titanic Kaggle data. We can change a lot of things to affect our Score and get into a wonderful rabbit hole of stats and prediction changing variables, models, and configurations. We could change which variables to use, change up the model away from a Random Forest, or even simply change our probability threshold for what constitutes “Survived” (remember we used 0.5). I would love to hear any changes you guys made and how it improved or lowered my Accuracy.

If you decide to get into the custom R code route, definitely check out the alternative part 2 workflow that uses it. For even more customization, if there’s a certain model that Alteryx doesn’t specifically use, you can always replace the model in the 3^rd workflow with an R Tool and specify your own model! And as always, I would love to hear anything you guys tried to improve the Grade.

As a closing note on the aspect of using Alteryx over R, it was an absolute blast. There were a couple sections I cut out because Alteryx has all that functionality built-in or has it within one tool. I originally thought creating the tables might have been a bit tricky to work with, but once I nailed down the Summarize/Crosstab functionalities, it was a breeze.

As a conclusion’s conclusion, I appreciate anyone who made it this far down the read and joined me in this explorative project. Happy Alteryx-ing folks!

-Mike