Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Data Science

Machine learning & data science for beginners and experts alike.
martinding
13 - Pulsar

Survival Analysis, Part 1: Introduction (You are here)

Survival Analysis, Part 2: Key Models

Survival Analysis, Part 3: Using Alteryx

Survival Analysis, Part 4: Using Python in Alteryx

Welcome to this four-part blog series, where we introduce a powerful analytical tool called Survival Analysis. In this series, I will provide a beginner-friendly guide to help you understand this popular statistical method.

 

In the first part, I will introduce the key concepts of survival analysis and show you some use cases where it can be applied. In the second part, we will dive deeper into the key models in survival analysis. In the third part of the series, we will walk you through how you can perform survival analysis using Alteryx. Finally, we will see how we can perform survival analysis using Python.

 

Whether you are a marketing analyst, medical researcher, engineer, or social scientist, this series will help you understand how to analyze time-to-event data and predict survivability. So, let’s dive in!

 

What is Survival Analysis?

 

Survival analysis – the statistical method that answers the all-important question: “How long until it happens?”  It was originally developed in the medical industry to predict the time until patient death (hence the name survival analysis). On the lighter side, nowadays, survival analysis is widely used in engineering, social sciences, and marketing analytics to predict what percentage of a group experiences a specific event as time or to compare time to an event in different groups.

 

I first encountered survival analysis when analyzing customer churn data, so why don’t we use churn to help us understand the topic?

 

In the case of customer churn, a survival analysis for a children’s clothing boutique can be visually represented in a graph like the one below. The x-axis of the graph represents the time elapsed since the customer's first visit to the store, while the y-axis shows the percentage of returning customers who continue to shop at the store (known as the retention rate). Each time point on the x-axis shows the percentage of the original customer population still active at the store.

 

Survival analysis also allows the comparison of multiple groups in the same chart, with each group being represented by its own line. Here, you can see the percentage of customers who continued shopping at the store among the groups who received weekly coupons and those who did not.

 

martinding_0-1683566957647.png

 

Why do we use Survival Analysis?

 

“Why do we even need survival analysis when we have machine learning?”

 

“Can’t we use classification models to predict churn?”

 

Yes, I hear you, and I agree when it comes to predicting customer churn, it’s easy to get caught up in the hype of modern machine learning tools. And, let’s face it, who can resist the temptation of high accuracy rates? However, there is at least one area where machine learning-based classification models fall short, and that’s predicting when churn will occur. This is where survival analysis truly shines, and knowing the “when” is really valuable to businesses:

  • Understanding when churn is likely to occur can significantly improve businesses’ ability to better prioritize and target customers. For instance, by identifying customers who are likely to churn after only one week of use versus those who are likely to churn after five years of tenure, the marketing team can develop tailored strategies to retain these customers.
  • A customer’s value is often related to how long they stay with a business. For subscription-based businesses such as Netflix, a customer who churns in 1 month is not the same as a customer who churns in 1 year in terms of Customer Lifetime Value (CLV).
  • Survival analysis allows us to deal with censorship (more on this later). If we do not predict a customer to churn right now, it does not imply that the customer never will churn. However, this aspect is often neglected in classification analysis, and this ability to deal with ‘censorship’ in data makes survival analysis a superior technique to traditional classification techniques for this type of scenario.

 

Key Concept: Censorship

 

Censorship, in the context of survival analysis, refers to losing track of an instance (in our case, that would be a customer) during an observation period or where the event (churn) has not been observed for a customer during this period. This is an important concept because if we don’t consider censorship, we will potentially introduce bias into our prediction — just because we haven’t observed a customer canceling a subscription doesn’t mean they never will. More specifically, there are three types of censorship:

  1. Right-censored data: When you do know when a customer started the subscription but don’t know when churn occurred (event end time):
    • either due to the customer record being withdrawn for reasons other than churn (e.g., data entry issue) or,
    • the customer simply hadn’t churned when we conducted the analysis.
  2. Left-censored data: When the customer churn time (end time) is known, but we don’t know when they started the subscription:
    • This may happen if a customer started the subscription before our observation period (e.g. when some customer data is in an older database and hasn’t been migrated to the current database used for analysis).
  3. Interval-censored data: When the relevant data is collected at a specific time interval, but the exact start and end times are not known.
    • For example, when we need daily data for churn analysis, but some customers’ info has been truncated to monthly granularity.

 

Use Cases

 

Survival analysis endows us with the ability to analyze time-to-event data on a wide range of topics. Literally, we can apply survival analysis to predict any event of interest that happens over time, where we can define a clear start and an end. Some of the common use cases include:

  1. Medical Research: Survival analysis is frequently used in medical research to study the time to onset of disease, death, or even time to hospital discharge. For example, a researcher might use survival analysis to study the survival time of patients with cancer after treatment or to study the time to progression of a disease.
  2. Engineering: Survival analysis is used in engineering to help predict maintenance and time to failure. For example, a researcher might use survival analysis to study the time to failure of a mechanical component or the time to failure of a bridge.
  3. Finance: Using survival analysis, we can predict the time to default of a borrower. For example, a lender might use survival analysis to study the probability of default of a loan portfolio.
  4. Social Sciences: Survival analysis is used in social sciences to study the time to event for a range of outcomes. For example, a researcher might use survival analysis to study the time to first marriage or the time to unemployment for a group of people.
  5. Marketing: Finally, survival analysis is used in marketing to analyze customer retention rates and churn. For example, a company might use survival analysis to study the time to churn of its customer base, to determine what factors influence churn, and to develop strategies to reduce churn rates. We will see this in action in the later sections.

 

Stay tuned for the next articles in this series--you can subscribe to the data science blog to make sure you don't miss any!

Comments
fmvizcaino
17 - Castor
17 - Castor

Amazing content, @martinding !!! Thank you