Data Science

MichaelF · ‎08-02-2018

Part 1 of this series covered the feature engineering portion of predicting which Titanic passengers would survive. In this second post, we'll implement imputation techniques to deal with missing data.

3. Missingness

Now that we’ve created some added variables, let’s look at our existing variables to see if there’s any imputation we’ll need to do for correcting nulls. We could get rid of any records that have nulls, since they are incomplete, but since our dataset is relatively small, we shouldn’t get rid of any rows that have missing data, or get rid of any columns that have any as well.

Let’s start by bringing in our Output from part 2 and looking at the current variables. If you look at the Results you’ll see most fields have a green line over them, and some have a little yellow. The yellow indicates the presence of nulls and if you hover over a specific field, you can see the percentage of nulls and the percentage that is OK. Additionally, we can throw a Field Summary and easily see which fields have nulls. Aside from “Survived”, we see that “Age”, “Cabin”, “Embarked”, and “Fare” have missing values. Cabin has about 70% missing, so there isn’t much we can do with it, at least for this blog, so we’ll leave this out.

3.1 Sensible Value Imputation

Since Embarked and Fare have a very small number of nulls, let’s look to fix them. Focusing on Embarked first, we can use a Filter Tool to find the two entries that are null for this field. Looking at the other variables, we can look to see if we can infer their embarked port by their Pclass and Fare. We can see that for both, their Pclass was 1 and their Fare was 80. We can use a Summarize and Filter on the non-null data to see which embarked port makes most sense.

I decided to look at the average and median fares for the 1^st class at each port and see what makes most sense to place our 80-dollar fare passengers. We can rule out port Q, because with only 3 observations, it’s unlikely there were prices other than the calculated average of 90. Port S, we can rule out because that would mean our passengers would be on the higher end of this distribution, which is a relatively low probability. However, looking at port C, we can see that 80-dollar fare is in between the average and median, so there’s a higher probability they would originate from this port. Using a Formula Tool, we can just add in “C” to those two passengers and Union them back to the original dataset.

The next variable to check out is the Fare. Using a Filter, we see that there’s only one person with missing data. In a similar fashion to how we dealt with Embarked, let’s Filter out the non-nulled data by the Pclass of the passenger (3) and the embarked port (S). If we add a Histogram Tool, we can see the distribution of the passengers and can see where our passenger would most likely belong.

It looks like the highest will be just under a 10-dollar fare. Let’s throw in a Summarize and see if using the average or median of these values makes most sense (and we’ll throw in Mode because why not).

Fare Stats.png

There we go! Looks like, 8.05 fits nicely in our graph, so let’s use a Formula and put “8.05” for the Fare of our passenger and Union everything back together.

SVI Workflow.png

3.2 Predictive Imputation

Now all that’s left is to complete our Age variable. We see it has around 20% nulls, which is too many to do simple imputations. Therefore, we’ll do some predictive analysis to fill in these nulls!

To do this, we’ll throw in a Filter Tool to separate the data with nulls and the data without, our very own train and test datasets! We’ll pop in a Forest Tool and use Age as our target variable. Our predictors are going to be Pclass, Gender, SibSp, Parch, Fare, Embarked, Title, and FsizeD (our discretized family size variable). If you’d like to learn more about the Forest Tool and Random Forests in general, I advise you check out Seeing the Forest for the Trees: An Introduction to Random Forest which does a great job of explaining all the small details of Random Forests.

Now that we’ve created the model, let’s pop in a Score Tool and predict some Ages! I also decided to use a Select to change the data type of the Score field, because it’s the easiest way for me to round these values. Now to check that our Predictions follow the same distribution as our data that had non-nulls, we’ll use Histograms to compare.

Our Predicted Ages

Our Original Ages

It looks like the first two bins at the 20 Age mark are a bit flipped, but everything else is pretty similar, and the differences are only slightly different. If we really wanted to play around with getting this accurate, we could try different models and different configurations, but we’ll just continue from here.

Note: The R-blog uses a specific MICE r package that does get the histograms to be more similar. I wanted to get away from using R specific code as these results suffice for our analysis and the whole point of this project was to not code in R, but feel free to check out the alternative workflow that uses the R Tool and the code associated with it. The MICE package is pretty cool and provides a method for doing multiple variable imputations, but since this was just one variable, I figured we could just use a standard Random Forest. To read more about MICE, I recommend this article.

3.3 Feature Engineering: Round 2

Now that we’ve finished our imputations, let’s see what other variables we can create. Since we’ll be comparing Survival, let’s get rid of the test data, since they have nulls for Survival. Since we know the common phrase for lifeboat priority were focused on woman and children first, let’s create a Mother variable and Child variable. However, before we do any of this, let’s look at the relationship between Age and Survival, separated by Gender. We’ll create a tag for Survived and not Survived and Filter on Gender. Then with Summarize Tools and Joins, we’ll create two graphs, one looking at women, and one looking at men.

Survival of Women

Survival of Men

We can see that there are more people in the mid-twenties, and that more women survived. It’s hard to really judge how children did here, so let’s create the “Child” variable to see how significant being a child was. In the Formula Tool, we’ll assume a Child is anyone less than 18 years of age and anyone older is an Adult. Then plug in a Summarize to group and count everyone up, and Crosstab to get a more readable table.

As we can see, it looks like being a child will increase your survival chances, but not by a whole lot. Now let’s see if being a Mother increases your chances of survival. In the same Formula Tool, let’s say a Mother is everyone who is female, their Parch variable is greater than 0, their Age is greater than 18, and that their title does not equal “Miss”. Now let’s do the same Summarize and Crosstab and see what we have.

Nice! It looks like there’s a pretty good chance that being a Mother will increase your survival. Now that we have all our variables and replaced our nulls, let’s throw in a Field Summary to make sure everything is in order. Make sure all your variables are the correct data type (Continuous and Count are Numeric and Categorical and Discrete are Strings). We can see that only Cabin has nulls, but since we’re not using this field, we should be all good to start predicting survival!

Feature Engineering Round 2.png

Note: Now that we’re all good, let’s recreate the Child and Mother variables and sort everything back together, and then create an Output that we can use for part 3.

Titanic Missingness Output.png

Next week we'll close out the series with the Prediction section.

Data Science

Life or Death: Missingness in the Titanic Dataset

3. Missingness

3.1 Sensible Value Imputation

3.2 Predictive Imputation

3.3 Feature Engineering: Round 2