Starbucks User Base Clustering Simulation, Data Scientist Nanodegree Capstone Project
In B2C marketing, two concepts are constantly top-of-mind: desired outcome rates, and wasted coverage.
On one hand, we have a marcom program, and we want to get it in front of everyone we can convert as quickly as possible.
On the other, the number of convertible people out there is minuscule by contrast to the number of people we could feasibly reach with our program. If we go nuts with the targeting, we waste money on ad costs that go nowhere. And we also waste the patience of those who might be interested in us later but just aren’t ready to hear from us, now.
How to maximize those delicious outcome rates while minimizing the dreaded wasted coverage?
As data scientists, we answer that question seemingly every day, if we work in the marketing space.
One very time-efficient way to answer that question is user base segmentation. We can take, say, some meaningful demographic fields from our CRM, cluster our user base into a few segments based on those dimensions, and then contrast outcome rates of various marcom tactics among those segments.
Imagine something like this:
That’s the approach I took to this project. I was given some simulated user data from Starbucks. With it, I grouped their users into 5 segments and contrasted the outcome rates of two different sales promotion types.
Ultimately, I was able to figure out how Starbucks can tell ahead of time whether or not a particular user is likely to respond to any, either or both of their sales promo techniques.
Author Note: I’m about to finish Udacity’s 5-month Data Scientist Nanodegree program. This blog post is a writeup of the procedure and outcomes of my final data science project in that program.
Project Definition
The project directive was to determine how Starbucks should target each of three promotions to a simulated user base:
- Discounts on in-store purchases (“Discount”)
- Get a free item after spending some amount of money in-store (“BOGO”)
- Purely informational promos with no call to action (“Informational”)
The basis of these targeting recommendations were to be the contents of a simulated transaction data set and four user demographic variables:
- Age
- Gender
- Income
- Account Creation Date
So, to Formalize the Business Question I Sought to Answer:
“How should Starbucks target its three promotional offer types in order to maximize their completion rates and minimize wasted coverage?”
Research Questions
- How likely are Starbucks users to respond to promotions of each type the first time they receive the promotion? (More on that qualification later)
- How can Starbucks’s user base be segmented such that promotions can be targeted to the segment(s) most likely to respond to them?
Answering these two questions analytically would get me to the root of my chosen business question.
Data Exploration
The Users Data Set
17,000 users are represented in Starbucks’s Users data set. Most of the 4 demographic variables are distributed relatively normally around a mean (if continuous) or relatively evenly across segments (if categorical).
The exception is the variable “became_member_on.” This variable represents the date on which the given user first created an account with Starbucks’s app. That variable skews heavily in favor of recent dates.
Age varies between 18 years and 101 years. Its mean and median (approx. 55) indicate that Starbucks’s user base are on the older side (see the histogram below for corroboration of that indication).
Income varies between $30,000 and $120,000 around a mean/median of $65,000.
Temporarily leaving became_member_on as a float allows me to somewhat cheat some descriptive statistics out of it, as well. This only works, because the format of the date is YYYYMMDD and therefore chronological order and numerical order are the same thing.
Our earliest date is July 29, 2013, and our most recent date is July 26, 2018. But the median is significantly higher than the mean, suggesting our outliers lie at the low end of the date range.
Take a look at the distributions for yourself:
Gender, age and income each have a way of calling out blank entries in user records. A blank income was “NaN” until I converted that to “0.” An age of 118 is an age not provided by the user. And non-provided gender values were “None” until I converted them to “Unknown,” above.
Users with one null value are users with null for all of gender, age and income. They actually are a segment of users representing 13% of all users. There are 2,175 of them in the User data set. This is a segment I will certainly cover in more detail in the cleaning section, below.
The Transactions Data Set
The Transactions data set initially contained 4 fields describing a whopping 306,534 recorded events:
- User ID (“Person”)
- Event: offer received, offer viewed, offer completed or (monetary) transaction
- Value: Dictionaries that contain offer id for offer events and transaction amounts for transaction events
- Time: Time in hours since the beginning of the promotion period that the recorded event took place (0 represents the launch hour of the promotional period)
As you can imagine, I really had to transform this one in order to make it useful (more on that in the cleaning section, below). Here’s what Transactions looked like in its birthday suit (first 5 events):
After cleaning, I was able to learn a lot more about this data set.
The average user received 4 promotions, viewed 3 promotions and completed 2 promotions.
Worth noting is that there are some possibilities which certainly affected the cleaning stage of this project in interesting ways:
- A user could theoretically receive a promotion more than once
- A user could complete an offer before receiving it
- A user could complete an offer after the offer’s duration had expired
These complications were all mitigated by my business question. Namely, the part about solving for the offer most likely to influence a user the first time it’s received.
For good measure, I can report here that the transactions data set contained no null values. Lucky day, indeed, in data science land:
The Offers Data Set
Offers is a simple data set. No null values, only 10 rows. It’s simply a dimension table describing offer IDs. Ultimately, I used it in merges to get “offer_type” into the transactions data set.
Here is the Offers data set, in its entirety:
Data Cleaning Decisions
The Users Data Set
My first decision with the Users set was to keep all NaN entries and consider them a separate segment. This is because every user with one NaN demographic variable had NaN in all of their demographic variables. They are evidently a segment of user who have not yet provided Starbucks with any of their personal data.
Second, I dropped became_member_on from the data set, entirely. Due to its skew heavily in favor of recent members, that variable was unlikely to have an insightful effect on the clustering analysis to follow. Most clusters would end up with a majority of new users.
Third, I cut age and income into segments to transform them into categorical variables. This allowed me to use judgment to encode meaning into those variables that a clustering algorithm would not recognize.
I could divide age into meaningful life phases and income into meaningful career levels without worrying about an algorithm making meaningless sense of their numeric versions.
Finally, (and the reason I cut age and income into categorical variables) I one-hot encoded the whole data set. Each of my chosen demographic variables were turned into n new binary variables, where n was the number of levels in the original categorical variables.
This final step prepared the user data set for clustering.
The Transactions Data Set
Up front, the transactions data set required that I parse the “value” field and extract from it an offer id. That allowed me to pair each event to the specific offer that was its subject.
From there, I filtered the set to remove rows describing “transaction” events. Money spent with Starbucks is irrelevant to my business question. The problem I’m helping Starbucks solve is driving app users to the store, not increasing dollars spent per store visit. Therefore, it was best and simpler to leave transactions out of analyses, altogether.
The next steps were to pivot out the “event” variable for “time” (time elapsed between the first day in the date range and the event’s date), merge in offer type from the offers data set and calculate a few new fields that would lead me to completion rates.
Above, “time” is converted from hours to days and appears as values in the offer columns. This allowed me to contrast it with duration, which is the length of time the row’s offer was valid after receipt.
From this data set, I calculated a binomial variable representing response rate. offer_response had two valid values:
0: If completion occurred before receipt, after duration or not at all
1: If completion occurred after receipt and within “duration” days of receipt.
The binomial offer_response variable allowed me to calculate response rates for each offer type and then test differences for significance, easily.
Analysis Methodology
Analysis progressed in two steps:
- Use K Means to cluster the user data set into 5 user segments
- Break down mean offer_response by cluster and offer type
Part 1: Use K Means to Cluster User Data into 5 Segments
The combination of demographic variables available and the levels of each involved made an unsupervised technique for clustering ideal. I could rely on mathematical distances among users to determine what “similar” meant in this data set.
For that reason, K Means is the algorithm I chose to cluster the user data set.
Applying the “elbow technique,” I determined that 5 clusters would be value of k I would set to train the K Means model. Specifically, I ran K Means once for values of k between 1 and 20. For each value of k, I calculated the within-cluster sum of squares and plotted it on a curve.
For this project, the ideal cluster count is the value of k that minimized its distance to the inflection point of the WCSS curve while providing for a human-manageable number of user segments.
Part 2: Break Down Mean Offer_Response by Cluster and Offer Type
The final step of my analysis was a simple one. I just had to merge cluster assignments with my cleaned transaction table, aggregate offer_response by calculating its mean at each cluster-offer_type intersection, and then finally pivot out the offer_type variable.
The resulting table was a simple contrast of response rates by cluster and offer type.
Extra: A Note on Cluster Model Improvements
Looking at the WCSS chart above, my original thought was that I should train for 6 clusters. It was closer to what looked like the chart’s “elbow.”
However, upon training the model on k=6, the resulting cluster breakdowns were unnecessarily difficult to understand. My concern became confusing business stakeholders with long-winded descriptions of arguably over-nuanced segments of users.
Instead of the 5 easy-to-remember clusters I ultimately generated, I found these when I targeted k=6:
- The Unknowns (of course)
- Female, Over $80k per year, all ages
- Male, $50–80k per year, all ages
- Female, under $80 per year, all ages
- Male, under $50k per year, all ages
- Male, over $80k per year, all ages
Segments neatly aligned with gender and captured all ages, but the income sub groups are overly complicated. We can no longer think easily of low earners, mid earners and high earners. Instead, we need to qualify nuanced combinations of those.
Even worse, 6 clusters don’t yield differing targeting insights from 5 clusters. Have a look for yourself at the completion rates of the 6-cluster version of the user schema:
Most clusters still prefer discounts to BOGO. Two clusters still respond reasonably well to both. The Unknowns still don’t respond to anything.
By reducing the cluster count to 5, my model produces clusters that are far easier for a decision maker to understand and remember. This eliminated a problem we could call “oversegmentation.”
Results
Clusters and their Natures
The clustering analysis yielded 5 intuitively meaningful segments among which to contrast outcome rates.
First, some interesting summary insights:
- Income was the most important determinant of cluster membership
- Gender was second-most-important
- Age was not important
- I ended up with 1 high-income segment, 2 mid-income segments and a low-income segment
You can see these insights brought to bear in the following matrix of frequency distributions. Columns are segments. Rows are demographic variables.
I interpreted the five clusters identified by my application of K Means in the following ways:
- High-Earners: People who make more than $80k per year, regardless of age or gender (although the data set favored people over 40 for this income group)
- Low-Earners: People who make under $50k per year, regardless of age or gender
- Mid-Earners, Male: Men who earn $50–80k per year
- Mid-Earners, Female: Women who earn $50–80k per year
- The Unknowns: People who have not provided Starbucks with any personal data, yet.
Response Rates by Cluster and Offer Type
The Unknowns were the standout user segment of the bunch. They respond at very low rates to both discounts (23%) and BOGOs (8%).
Mid-Earning Women and High Earners were somewhat alike in that they responded at high rates to both discounts (75/77%) and BOGOs (68/70%).
Low Earners and Mid-Earning Men were somewhat alike in that they responded at high rates only to discounts (63/68%). By contrast, targeting them with BOGO (35/45%) resulted in greater than 50% wasted coverage.
Testing Response Rates for Significance
To test each offer_type comparison for each cluster, I calculated 95% confidence intervals assuming a binomial distribution.
Every difference passed the test for significance (both practical and statistical). That is, no cluster saw any overlap between the 95% intervals for discount completion rate and BOGO completion rate.
To avoid flooding this blog post with confidence interval plots, I’m making the whole set available in the Jupyter Notebook in my Github repo. I’ll include one below to demonstrate the concept, however.
Implications
To maximize overall completion rates, Starbucks can target every segment but The Unknowns with the discount promotion type. There is reason to experiment further with BOGO among the higher earner cluster and the middle-earner female cluster. It appeals to them.
Can we further divide those segments such that one is an optimal BOGO segment while the other is an optimal discount segment?
To minimize wasted coverage, Starbucks should consider a different approach altogether to reach The Unknowns. It may be that The Unknowns are a group of users who are not very invested in Starbucks or its products, yet. But, there’s evidently some interest among that segment, otherwise they would not be users, at all.
Focusing the informational promotion on them may be useful, in particular if it nudges them toward providing their demographic data once their involvement has reached a certain level.
Some Interesting Bits I Learned from this Exercise
The data I used were simulated, but making actual recommendations to Starbucks with this project was not the primary objective. Primarily, I wanted to explore a dataset and K Means as an unsupervised clustering technique.
Implementation is really quick. Further, producing visual insights from the clusters rather than training another predictive model can be an extremely efficient practice in our oft-time-constrained field.
Here are the insights that expanded my mind the most as I completed this project:
- Oversegmentation: Even close to the WCSS “elbow,” training for one cluster too many can make resulting user segments more difficult to understand than necessary. An entirely automated system wouldn’t care how many segments we ended up with. In cases like this one, though, where a human is making decision based on the data, digestibility is a strong asset.
- Null Values as a User Segment: Finally! I’ve heard tale of the mythical “null” user segment one can find in some data sets. Alas, this is the first time I’ve encountered such a segment worth keeping. And it ended up being critical. Starbucks can avoid significant wasted coverage by focusing comms on deepening the relationship with that 13% user segment rather than diving straight into the sales promotions with them.
- Confidence Intervals Rather than P Values: Assessing statistical significance with minimum Type I error rates is well understood in the world of data science. But it has its drawbacks. For one, there’s no strong comeback to “why must alpha be 0.05?” For another, it’s extremely rich in concept and can therefore be kind of out-of-reach for a business stakeholder making a decision. Plotting confidence intervals can bypass all of that in cases like this one, where there’s no overlap and large gaps between the two intervals being tested. Presentability without ambiguity is a golden combination.
Even More Detail!
Visit my Github repo to explore the datasets and my research notebook for yourself. Let me know in the comments if anything new occurs to you in the process!