Applied Data Science with Churn Problem. Building Predictive Machine Learning Model for Consumer Business Industry

20 min readDec 30, 2021

Intro

In the consumer product/service industry, the churn problem is crucial for business. A churn problem is a general way of describing customers switching to another or canceling out — for example, telecommunication phone plans, gaming platforms, entertainment streaming platforms. However, every business defines the Churn problem differently concerning each’s business model nature. For example, some may define churn to paid subscribers only based on MRR (monthly recurring revenue). On the other hand, for some utterly free service, these businesses may define churn on activity-based, that observe on user’s recency activity, who stop using the service within an allowable time gap. (e.g Twitter, Instagram).

In addition, for companies that offer a freemium business model (free tier + paid tier), users have options to cancel membership or closeout account; businesses may define churn on an overall cancellation based.

Project Description

Project Overview

Sparkify is a simulated music streaming service similar to a Spotify-like business. I will work through churn problem solving based on Sparkify’s event log dataset in this post. The goal is to measure the churn, design metrics for forecasting churn, and build a predictive machine learning model.

Project Problem

Churn is often an overlooked issue for the consumer industry. By convention, Stakeholders, Customer Success Team often emphasizes acquiring new users, allocating a large amount of resource and energy for competing for new clients. However, the unit cost to getting a new customer is relatively more expensive than keeping an existing customer. Furthermore, once users switch to a competitor, it is even harder to get them back. Data scientists spend efforts on forecasting users’ churn probability in a lead time — extract pattern within churn users. Even find out latent factors associated with the product and or pricing model.

For the Sparkify project, the problem is measuring the overall churn rate, designing analytic metrics, and building a classification model to predict churn.

About This Post

The post will focus more on applied data science on the churn, the design process, and methodology. For full detailed implementation, I host my end to end source code for this project here:

Link on my GitHub.

https://github.com/gilzero/project_churn_spark

Technologies Used for this project

Python 3.9
Jupyter
Pandas
Numpy
Plotly
Matplotlib
Scikit-learn
XGBoost
Apache Spark (Databricks) for the full set

—

Data Exploration

It comes with a mini size and full-size event datasets. For this post, I worked on the mini-size set locally. It contains 286,500 event logs for 225 distinct users.

Raw data attributes:

ts: timestamp of event log
userId: account ID
sessionId: each browsing session ID
page: all kinds of pages user encountered for the event
auth: the authentication status of user for the event
method: HTTP REST method
status: HTTP request code
level: indicate user’s tier, free-tier or paid-tier
itemInSession: the cumulative count number of items in the session
location: user’s location (U.S based)
userAgent: browsing hardware / OS
lastName
firstName
registration: timestamp of user’s registration date time
artist: artist name of the song listen
song: song name
length: playback duration of a song play event

Example of data instances:

By checking the max and min timestamp of the event, it shows the dataset is about a two-month interval sliced event logs.

There are some instances of missing userId, indicating anonymous users. I will trim these instances.

Consumer products usually follow the weekly cycle of human behavior. So, out of curiosity, let’s start with checking out what days of week users use the service to stream music.

I initially thought users tend to use streaming entertainment on the weekend than on weekdays because people might be busy at work or studying during weekdays. Still, the dataset reveals it is people listen more on a weekday. It is a reminder that a quick visualization can help you justify any preconception or misconception.

One of the most crucial attributes in the raw log is ‘page’.

Within values of the ‘page’ attribute, the dominant event is ‘NextSong’, when the user clicks ‘NextSong’, indicating a song play count.

There is no description of attributes and values for the dataset, but I would infer them as such events:

NextPage: indicate a song play event
Home: home page visited
Thumbs Up: a ‘Like’ button hit for a song
Add to Playlist: user added a song to his/her playlists
Add Friend: user hit Add Friend social button
Roll Advert: user experienced an advertisement during song listening
Login: user login action
Logout: user log out action
Downgrade: user viewed the Downgrade page.
Help: user viewed the Help page
Settings: user viewed the Settings page
About: user viewed the About Us page
Upgrade: user view the Upgrade page
Save Settings: user adjusted settings
Error: user encounter an error
Submit Upgrade: user submitted an upgrade (nice :-) )
Submit Downgrade: user submitted a downgrade, down-sell :-(
Cancel: user viewed the Cancel page.
Cancellation Confirmation: user submit and confirm the cancellation. (Oh no!)
Register: user registration page.
Submit Registration: new user entered.

There are all the possible activity events for users. Here is the bottom line: for consumer service, we should focus primarily on engagement and utilization-related events, things like song play, like, dislike, playlist. In the next section, we need to pay attention to these primary behavioral metrics.

User experience related to page browsing events such as ad listened, home page view, help page, adjust the setting, etc. These might not be important as those of engagement/utilization metrics.

Measuring Churn

Measuring churn can be tricky and confusing if you don’t stick to your specific business model and nature. For churn measuring in this project, we have possibly related events in the log such as Submit Downgrade, Cancellation Confirmation. But how to identify? Should I count Downgrade as churn in this case? Here is the logic: It is a freemium business model. Both free and paid tier users can cancel. But only paid users can downgrade. I consider Downgrade as Down-Sell for this scenario. Users that downgrade would continue to use the service. They are still retained. If a business is measuring churn based on MRR (Monthly Recurring Revenue) formula, since Downgrade is a Down-sell means possibly reducing the MRR, if that were the case, I would consider Downgrade as churn. However, based on the limited information of the dataset, we do not have pricing data. I decided to measure churn as standard count based. I will ignore Downgrade as churn. So the indicator for churn is ‘Cancellation Confirmation.’ Users that finally confirm cancellation are the users who churn.

After some calculation for churn measurement:

Number of all distinct users: 225
Number of paid users: 165
Number of free users: 60
Number of Churn Users: 52
Number of Retain Users: 173
Churn rate within all users: 0.23
Churn rate within free-tier users: 0.35
Churn rate within paid-tier users: 0.19

Churn Rate: 23%. (Number of Total Churn Users / Number of All Users)

It also shows that paid-tier users have more ‘loyalty’ than free-tier users. (19% vs 35%)

Feature Engineering (Metric Design)

Now the fun part: Feature Engineering. (Or in business terms: Metric Design). Feature Engineering is a more adopted term in the data science/data engineering field. They might be interchangeable that both are about generating analytic features/metrics for building a model. There are no strict rules. Some believe it is a combination of art and science. Exciting? Let’s get started!

Feature: Tenure Days

Tenure means the number of days or length the users have been using the service. For this dataset, the formula would be

Tenure = latest event date time — user registration date time

Visualization check:

churn users generally have fewer tenure days than retain/non-churn users. (57 days vs. 86 days)

Hypothesis: churn users have fewer tenure days. Metric: tenure

Feature: UserAgent

The raw UserAgent example instance like “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.103 Safari/537.36”. It gives us specific information such as browser, device os, hardware. I will try to generalize to the device to check if the device might be a factor.

Apply a custom transformation, create a ‘device’ column

Also, compute the percentage of churn for each device group.

Stats screenshots

Visualization of device churn check

It seems that the Linux user has more churn percentage. However, the instances of Linux are limited. At this point, not sure is by random chance or not. It will become more clear if we have more Linux user data. For the mini set, I will include it in training anyway. (Note that there is a risk of over-fitting due to the few instances)

Hypothesis: Users using the Linux version of the product are more likely to churn. Metric: is_linux

Feature: Location

Raw event contains users’ location information in city and state abbreviation code. They are too specific. I will generalize them by transforming them into four central U.S regions and comparing percentages.

Stats results:

Visualization check

As the plot shows: they are somewhat proportional. There is no typical evidence that region is a factor of churn. Ignore location.

Feature: Gender

Gender is another categorical data available in the dataset. Is gender a matter to churn? Let’s find out:

Compute the percentage comparison.

Stats:

Visualization Check:

Another way of grouping visualization check:

The difference is noticeable.

Hypothesis: males are relatively more likely to churn than females. Metric: is_male

Feature: Level

Is there any difference in churn rate for free-tier users vs. paid-tier users?

Stats

35% vs. 18%

Visualization Check

There is a clear difference.

Hypothesis: Free-tier user churn at a more significant rate. Paid-tier users have more ‘loyalty’ for the service. Metric: is_free

(Features: Aggregations of all kinds of ‘page’ events.)

Next, I will iterate to calculate aggregation for different events, lay out the stats, and decide whether to establish a metric hypothesis if necessary. (Since the aggregation are pretty straightforward, I will not visualize every single of it for saving some space for this post)

Feature: Song Play Count

Churn-Users: 699.88 (avg)
Retained-Users: 1108.17 (avg)

Retain users averagely play significantly more songs than churn users.

Hypothesis: churn users play fewer songs. This is a primary metric. Metric: n_play

Feature: Like Count

Churn-Users: 35.75 (avg)
Retained-Users: 61.80 (avg)

Hypothesis: retain user hit ‘Like’ more. Metric: n_like

Feature: Dislike Count

Churn-Users: 9.54 (avg)
Retained-Users: 11.85 (avg)

Not much difference. But we will test it out. It may relate to engagement. Metric: n_dislike

Feature: Add to Playlist Count

Churn-Users: 19.96 (avg)
Retained-Users: 31.72 (avg)

Hypothesis: churn users add fewer songs to the playlist. Metric: n_addtolist

Feature: Add Friend Count

Churn-Users: 12.23 (avg)
Retained-Users: 21.04 (avg)

Hypothesis: churn users add fewer friend. Metric: n_friend

Feature: Home Page Viewing Count

Churn-Users: 32.15 (avg)
Retained-Users: 48.61 (avg)

There is a noticeable difference also. Statistically speaking, churn users visit the homepage less. Retain users more frequently to check the home page. However, is homepage visit matter to engagement or service utilization? I am reluctant to use it. I decide to ignore it. I will explain the decision more below.

Feature: Advertisement Listened Count

Churn-Users: 18.60 (avg)
Retained-Users: 17.14 (avg)

Not much difference. Ignore it.

Feature: Downgrade Page Viewing Count

Churn-Users: 6.48 (avg)
Retained-Users: 9.93 (avg)

It shows some difference, really not that much, and not related to engagement. Ignore it. I will also explain the decision below.

Feature: Submit Downgrade Action Count

Churn-Users: 0.17 (avg)
Retained-Users: 0.31 (avg)

It is a noticeable difference — even double. But I would not select it, which will be explained below.

Feature: Help Page Viewing Count

Churn-Users: 4.59 (avg)
Retained-Users: 7.02 (avg)

Not much different at scale. Ignore it.

Before continuing, let’s pause a minute to explain some decisions I made for the aggregation data.

For consumer product/service, the bottom line for feature engineering (metrics design) here is: WE SHOULD FOCUS ON USER BEHAVIORAL EVENTS, specifically on:

Engagement, and
Utilization

Some people might argue that statistically speaking, those stats showing noticeable difference should also be considered.

Theoretically, yes. But I would like to emphasize that metrics should be explainable and meaningful for our specific problem and purpose. In this case, we draw our attention span on metrics that to engagement and utilization. Including all kinds of statistically sound metrics may score higher in the model. However, it might risk over-fitting by feeding too many nuances to Machine Learning training to memorize things. To recall that, our goal for Machine Learning, building a predictive model, is trying to extract and generalize the pattern as much as possible, not to overfit or memorize. Please take a look at our example. Let’s say number_of_downgrade_action shows a relationship to churn rate, but we need to think this way: what are the behaviors causes downgrade. Do Home page or Help page viewing matter to song playing engagement? Sometimes, it is not a good idea to overthink. Our goal is to generalize core product engagement and utilization here. Let’s move on.

Quickly wrap up some decisions so far:

Categorical data (Nominal Data). is_linux, is_male, is_free
Discrete data aggregation, select relate to user engagement and service utilization, and show some evidence — things like song, like, dislike, playlist, session, items in session, tenure days.
Continuous data, for example, playback duration length.

With some computation, I have calculated the following features (metrics) for each user.

n_play
n_session
n_item
n_addtolist
n_addfriend
n_like
n_dislike
n_distinct
tenure
total_length
ratio_length_per_session
ratio_addtolist_per_play
ratio_like_per_play
ratio_dislike_per_play
change_perc_n_play
change_perc_total_length

For details of calculation and formula, refer to my source code host here. The meaning of each metric should be self-explainable with its variable name. First, however, I want to elaborate on ratiometric and change metrics.

Ratiometric is a metric design from two or more standalone metrics. It often indicates a unit cost, unit usage, or unit engagement. Y per X. Interestingly, some standalone metric themselves does not show a relationship to the churn, but it reveals a trend when embedded to a ratiometric. You will see detailed evidence next.

A quick look at a ratiometric stats:

And for Change Percentage metric is a measure of momentum shifting. It is a kind of design in many different industries. E.g., the Stock market measuring PE ratio change. Coffee Roasting metric RoR (Rate of Rise) measuring the dynamic temperature changing. In this case, I generate a change percentage of the song played and a change percentage of playback duration in two different recency periods. The design is to try to see whether the momentum of change has any relationship to churn.

A quick look of a change percentage metric stats:

Create a flattened analytic dataset

Once the features are computed, let’s create an analytic dataset by flattening the metrics.

Here is a quick look at the analytic dataset I created:

Also, don’t forget to concatenate categorical features transformed:

And the outcome label columns is ‘is_churn’

Then fill the missing data.

Metric Cohort Analysis

Next, let’s conduct a metric cohort analysis. For each feature (metric), divide into ten cohorts based on actual values, and compare each cohort’s mean value to the churn rate against each cohort. Then, plot them in a line chart to analyze the relationship with churn. We will also check the skewness stats.

The first run attempt of metric cohort analysis. Results:

Our first run of metric cohort analysis isn’t too bad but certainly noisy. Don’t worry, and we will work on the refinement next. Some metrics do show a relationship with churn, while some are volatile. Let’s take away the high skewed metrics.

Skewness Check

Bottom line: high skewed metrics can be unreliable/problematic.

A skewed metric around zero indicates it is approach evenly distributed around the mean. More evenly cohort group values. Suppose a skewness is too high. It may result in a long tail. For example, right long-tail shape skew, most metric values are before the mean value.

Data Normalization

The strategy here is to remove skewed metrics and normalize the remaining metrics data values. Then, conduct another round of cohort analysis to see how it goes.

I use MinMaxScaler() to transform the analytic data to scaled values.

A quick look here:

Refinement

With reselected and normalized data, run another Metric Cohort Analysis. This time I also zoom out a bit to 6 cohorts. Here is the result: (2nd round)

Visually look, so much improvement. Isn’t it! Most of the metrics show evident relationships with churn.

However, n_dislike and ratio_like_per_day are still not telling a story. Therefore, I will filter these two metrics out.

Be patient. Time to run the last round of Metrics Cohort Analysis to confirm. Let’s see the result:

Awesome! After all the effort on Feature Engineering with rounds of refinement, the metrics re-selected are all showing some degrees of relationship / pattern here.

Before we move to the next section, I mentioned above a reminder for ratiometric design. Have you noticed the interesting found?

Check the plot of n_dislike vs. ratio_dislike_per_play

That’s right. A standalone metric of the number of dislikes each user n_dislike does not show any clear pattern. However, when we design it into a ratio representing a unit of disengagement, ratio_dislike_per_play, the higher it is, the more likely the user churn.

Machine Learning

We have done some bits of Data Science + Data Engineering works above so far. Now, let’s move to a bit of the artificial intelligence part for the project: Machine Learning. This churn problem is a supervised classification model labeling a binary outcome: prediction of YES/NO (1/0). For this project, I will approach Logistic Regression and XGBoost (Decision Tree) models and compare their performance.

If we have adequately designed the dataset, it should be easy for ‘Machine’ to crunch the analytic data. So I will quickly lay out the declarative code and then explain validation, evaluation, and justification in detail.

Logistic Regression

Import libraries

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

Firstly, subset independent variable values (features set, the X) and dependent variable (outcome set, the y).

# Prepare independent variable (X) and dependent variable (y)

# Independent variables values. the X. (dataset that without the user id, and outcome label columns)
X = df_normalized.iloc[:, 1:-1].values

# Dependent variable. the y
y = df_normalized['is_churn'].values

Create training data, holdout test data split, then train the model.

# Create train set, test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Initialize and logistic_regression instance
log_reg = linear_model.LogisticRegression(penalty='l1', solver='liblinear', fit_intercept=True)

# Train the model
log_reg.fit(X_train, y_train)

Check result with train set and test set.

# Not use accuracy score, because mostly 0 value. unbalanced.
# So use f1 score.

#f1 score on train set
print(f"f1 score on train set: {f1_score(y_train, log_reg.predict(X_train), average='weighted')}")

#f1 score on test set
print(f"f1 score on test set: {f1_score(y_test, log_reg.predict(X_test), average='weighted')}")

f1 score on train set: 0.6994369057509463
f1 score on test set: 0.7295742232451093

XGBoost

“XGBoost initially started as a research project by Tianqi Chen as part of the Distributed (Deep) Machine Learning Community (DMLC) group. It became well known in the ML competition circles after its use in the winning solution of the Higgs Machine Learning Challenge.” (Wikipedia)

Import library and train. Reuse the train test set.

import xgboost

# XGBoost classifier model instance
xgb_clf = xgboost.XGBRFClassifier(use_label_encoder=False)

# Train the model.
xgb_clf.fit(X_train, y_train)# Train set f1 score.
f1_score(y_train, xgb_clf.predict(X_train), average='weighted')

# Test set f1 score.
f1_score(y_test, xgb_clf.predict(X_test), average='weighted')

f1 score on train set: 0.8949914949914949
f1 score on test set: 0.8222222222222222

XGBoost is outperforming Logistic Regression in this case. Let’s further conduct hyperparameter tuning to see if we can improve even more on the model. Use GridSearchCV

# GridSearch Cross Validation Tuning the model

param_grid = {
    "max_depth": [3, 4, 5, 7],
    "learning_rate": [0.1, 0.01, 0.05],
    "gamma": [0, 0.25, 1],
    "reg_lambda": [0, 1, 10],
    "scale_pos_weight": [1, 3, 5],
    "subsample": [0.8],
    "colsample_bytree": [0.5],
}

# XGBClassifier instance
xgb_cl = xgboost.XGBClassifier(objective="binary:logistic", use_label_encoder=False)

# GridSearchCV instance
grid_cv = GridSearchCV(xgb_cl, param_grid, n_jobs=-1, cv=3, scoring="f1_weighted")

# Train
grid_cv.fit(X_train, y_train)# f1 score on train set
f1_score(grid_cv.predict(X_train), y_train, average='weighted')

# f1 score on test set
f1_score(grid_cv.predict(X_test), y_test, average='weighted')

f1 score on train set: 0.8637037037037038
f1 score on test set: 0.8418679549114332

Validation, Evaluation, Justification

As the results show, the winner model is XGBoost with a test set f1 score: 84.2%. (After hyperparameter tuning). It is about 11% more accurate than Logistic Regression for this case.

Keep in mind that this sort of churn dataset is highly imbalanced. The majority of the observation is NotChurn, indicating the retained users. In other words, there will be much more observations of the ’0’ label than the ‘1’ label.

(Note that, for train test split process. The ‘stratify=y’ argument is important here. It ensures to maintain the proportion of values in split to the original dataset.)

What has been fed into this Logistic Regression training is biased. The logistic regression here might be over-optimistic because, with the imbalanced data it learned, it favors to predict ‘0’ as the outcome. It is not helpful for churn problems; we miss out on opportunities to detect churn.

Some mention that the word ‘regression’ in Logistic Regression is confusing because it is actually a classification by estimating the probability of an outcome. So it makes sense that for this case, it ‘lost’ to XGBoost as the labeling is not about estimating continuous value; it Is so much tricky to calculate coefficients for such churn problem.

In contrast, XGBoost is based on decision trees and random forest training. It creates different trees and has cleaver penalization of trees and correct errors of previous trees. It works excellent for classification churn problems.

The primary concern of the prediction/forecasting for churn is Type II Error: False Negative. Here is why: with a fire alarm analogy, a fire alarm does not ring the bell when a fire is happening, life-threatening. For a churn problem, a False Negative would be like, fail to identify churn when the churn happens, and then we would miss the best timing or opportunity to retain that user, miss out or underestimate. However, in another way around, if a model accidentally identifies a non-churn user as a churn, the marketing team send out a special offer to them, e.g., give out free extra usage, it makes the existing happy customer happier, from a business perspective, it may increase cost in a short-term. Still, it is not harmful as miss identifying real churn.

Generally, it is better not to conclude a detect than a detect (Type I Error), e.g., Medical Industry. But not always, for consumer churn problems, Type II errors are worse in this case because once we fail to detect churn when they leave, they DO go. They hardly turn back.

The aim of the model is not about the ability of successful labeling of ‘0’. So, for example, if 10 out of 100 users would churn, with a model predicting all ‘0’s, it still scores 90% because 90 out of 100 are ‘0’s based on the default accuracy score. Therefore the default accuracy score does not make sense. In that case, it missed out on all ten of ten churns and still got a 90+ score.

This is why I pick F1 Score as the evaluation metric. Because what we concern more is how many real churn that the model is able to identify. It takes harmonic mean of precision and recall rather than blindly default accuracy.

(For more details of formula, check Precision and Recall, F-score. )

To run validation based on f1 score, simply pass holdout test set of train data to a model’s predict method, check the return value against holdout test set of outcome data, with f1 score method.

Besides, I give you the concrete example of why I am disappointed with Logistic Regression here that compares the predictions of the two modes just made:

Logistic Regression prediction on test set: almost all ‘0’s for actual churn instances.

XGBoost prediction on test set: Much better performance with most actual churn instances predicted.

Number of actual churn instances is successfully predicted: XGBoost

Let’s put the performance of model into a table for comparison:

As you can see, XGBoost scores 11% more than Logistic Regression, but know the difference in the number of real churns predicted is a vast difference — 7 vs 1! So that is quite impressive for XGBoost for churn problem classification.

In contrast, even Logistic Regression scored a 0.73, looks acceptable. BUT the actual prediction is inferior.

For XGBoost grid search cross validation training, scoring=”f1_weighted” is specified, telling the training aim to maximize based on this score.

Reflection & Improvement

This is a challenging and highly practical project to work on. I particularly enjoy the process of Feature Engineering (Metrics Design) part. However, there are no customer details and subscription details data, and only a raw event log is provided. Therefore, I need to invest more time and experiments on various metrics to develop a churn problem strategy and refinement iteration. I would say probably almost 90% of my time on this project is on Feature Engineering. About 5% of the time on Data cleaning, wrangling, preprocessing, transforming. The rest of 5% on Machine Learning models.

Further improvement can be made, including grouping correlated features, also known as Dimension Reduction. Before that, calculate a features correlation matrix to decide what metrics to filter out or group together. It will provide further advantages to generalize the model.

For the full size of 12GB data dataset, it is better to run with Apache Spark. Amazon AWS EMR is an option for setting up. However, I highly recommend using Databricks to work with it. I found it is much easier and quicker to set up the cluster and provide a cleaner interface. (Databricks is a higher abstract level application based on Apache Spark.) It also helps me auto-scale up and down the number of node workers. The methodology and general flow for data science and machine learning implementation are the same as the mini set as we work through above. We just need to use another set of data engineering tools for Databricks. It provides Pandas DataFrame equivalent libraries like PySpark, and SparkSQL to run SQL if you prefer the SQL way. For example:

And recently, Databricks team has been porting Koalas to Databricks runtime, which means we can also do Pandas like syntax in Spark.

I recommend anyone needs Apache Spark to try Databricks. It charges on useage based. (I have no affiliation with it.)

Reference & Credit

The snippet of Metric Cohort Plot and Skewness reshape function is from Carl S. Gold, the author of ‘Fighting Churn with Data’. I would also like to recommend his book ‘Fighting Churn with Data’. I enjoy reading his book, very knowledgeable, which provides systematic techniques for working with all kinds of subscription-based businesses. It is a great book that also guided me in approaching the project.

And, thanks to Udacity providing this raw event log dataset.

Thanks for reading!