Ecommerce Analytics Playbook – Cohort Analysis

This article describe the steps on how to run a cohort analysis in python. However is also part of something more extended that I always wanted to work on. In the past few weeks I’ve written a series of articles on practical use of Python for specific eCommerce use case. Consider it as a basic toolkit of concepts and methodology for analysts to quickly answer the most relevant questions around an eCommerce business, from overall performance to active/repeat buyers, profitability, lifetime value or basket composition. These should be the essential for any transactional business on the internet.

Please note this is still a work-in-progress. I’m consolidating years of notes and smaller projects, so the list below is likely to change:

  1. Cohort Analysis
  2. eCommerce Exploratory Data Analysis
  3. Association Analysis
  4. Customer Segmentation
  5. RFM Analysis
  6. Customer Lifetime Value
  7. TBD

…ok, but WHY??

Most machine learning and data analytics courses provide a solid foundation on methodology and overall statistical knowledge. This is clearly necessarily and indeed it should be the core of any course, however most of them show significant weaknesses in two main areas:
  1. Examples are usually extremely simplified versions of real problems. While this is great for beginners, the reality is often way less forgiving. The average data available in most companies is often incomplete, requires significant cleaning and feature engineering. The transition from class materials to the real world can be quite unforgiving. The closer to a real world scenario we train and learn, the less intimidated we’re going to be.
  2. second and to me the most relevant: there is very little information available for commerce analytics specifically. Most courses provide general knowledge on methods and techniques, but hardly specific use cases for commerce or marketplaces. And most of these courses focus on quite some aspirational goals totally overlooking the basics (e.g. active/repeat buyers, CLV, basked analysis…). Most students would finalize those courses with a pretty decent understanding of statistics but could hardly relate which tool is best for which problem.
The goal here is to provide the reader with enough knowledge to solve real day to day problems in any commerce or transactional environment using Python and machine learning libraries, clearly relating each use case with the right tool and eventually having a positive impact to the bottom line of the company.

…so who is this for??

I’d rather say for who is this NOT:

  • If you’re an accomplished data scientist on the 0.5% of Kaggle, you’d be better closing this tab and switch over to reddit. I’m afraid there’rs nothing
  • If you’re not a data scientist but know enough Python to build and deploy your own models, you’d be also better getting on reddit now


  • If you’re a marketer or a business lead with a pretty decent technical background and some coding, you could enjoy this
  • If you’re a business analyst with strong SQL skills and decent business acumen, this could be a decent intro (and mostly immediately actionable) to Python and data science.

…fine, I’m neither one of the last two categories but I want to do this. What do I need?

  • Some familiarity with Python. Don’t expect to understand each and every line of code but at least have a grasp of the language
  • Python3.6 installed on your machine (look into this guide for OSX and this other one for windows if you dont have any). Also Jupiter Notebooks is a great tool to build and test stuff locally.
  • Some business understanding (e.g. you must know how an eCom business work, but that should go without saying).

OK, let’s start.

What is Cohort Analysis

Straight up from wikipedia:

Cohort analysis is a subset of behavioral analytics that takes the data from a given data set (e.g. an EMRS, an e-commerce platform, web application, or online game) and rather than looking at all users as one unit, it breaks them into related groups for analysis. These related groups, or cohorts, usually share common characteristics or experiences within a defined time-span.

What this means is that we’ll be slicing our dataset in different bits with the same characteristics and we’ll try to understand how differently these groups behave through time.

Always keep in mind that cohorts are mutually exclusive segments that compare metrics across product lifecycle but also compare metrics across customer lifecycle. Essentially you can have time cohorts, behavior cohorts and size cohorts. For this example we will be looking at behaviour through the customer lifecycle, measuring time between first and following purchases.

Why (eventually) every business needs to do Cohort Analysis

Current eCommerce often struggle to switch from a transactional/sales-oriented to a customer-based perspective. A last-click-sales-oriented perspective of the business is one that primates gross revenue, average order value and sales and often neglects active buyers, repeat buyers, new or reactivated buyers. These metrics are all complementary to each other.
This mindset has a lot to do with the level of maturity of an analytics organization. Early stages eCommerce are mostly focused on sales, mostly because of the lack of volume of clients and the constant needs to refine their operational
These metrics are even more relevant for any SaaS product in  which the transactional perspective is less relevant (you might have a limited number of tiers or plans, with little impact in terms of sales variance) therefore the important metrics are tied directly to the user, the amount of time it decides to remain with the product and what takes them to leave the platform.

The dataset

Here’s the dataset. This is a pretty basic purchase dataset, with customer IDs, OrderDate, OrderID. In the real world you might need to do quite some wrangling before getting to a clear table like this (possibly join a user data table with a purchase table for instance). The most relevant characteristic here is that each purchase or order has a “pickup date” associated and a Order Date, which is also related to a specific user ID. With these three pieces of information we can clearly understand what was a user first purchase, if the same user went back purchasing again later, which is essentially what we want to understand. The time of purchase will be the defying dimension of our dataset.

Now we could easily make things a notch more complicated, adding for instance the “channel” dimension, or the “device” dimension (are users from mobile devices coming from paid search campaigns more likely to repeat their purchase etc. etc.?). In that case we should add this new variable to the mix and create a further drill-down dimension to our cohorts, however for the purpose of this example we would stick to the most basic scenario.

The process

  1. Review the dataset
  2. Create additional attribute and necessary feature engineering
  3. Review and visualize the cohorts

The code

Let’s fire up all the relevant libraries. Numpy and Pandas will be required for some wrangling, Seaborn and Matplotlib for visualization. Datetime will be required to work on the time attribute of the dataset

import numpy as np
import pandas as pd
import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt

Now we need to read the excel file:

df = pd.read_excel('relay_foods.xlsx')

Let’s look at the dataframe:

OrderId OrderDate UserId TotalCharges CommonId PupId PickupDate
0 262 2009-01-11 47 50.67 TRQKD 2 2009-01-12
1 278 2009-01-20 47 26.60 4HH2S 3 2009-01-20
2 294 2009-02-03 47 38.71 3TRDC 2 2009-02-04
3 301 2009-02-06 47 53.38 NGAZJ 2 2009-02-09
4 302 2009-02-06 47 14.28 FFYHD 2 2009-02-09

The OrderDate column shows year, month and day. For the purpose of this exercise, we’ll be creating cohorts based on months. For this reason, we’ll need an additional column based on the OrderDate column, stating only the month:

#process order period date to cohort label

df['OrderPeriod'] = df.OrderDate.apply(lambda x: x.strftime('%Y-%m'))

Let’s look again at our dataframe:

OrderId OrderDate TotalCharges CommonId PupId PickupDate CohortGroup OrderPeriod
47 262 2009-01-11 50.67 TRQKD 2 2009-01-12 2009-01 2009-01
47 278 2009-01-20 26.60 4HH2S 3 2009-01-20 2009-01 2009-01
47 294 2009-02-03 38.71 3TRDC 2 2009-02-04 2009-01 2009-02
47 301 2009-02-06 53.38 NGAZJ 2 2009-02-09 2009-01 2009-02
47 302 2009-02-06 14.28 FFYHD 2 2009-02-09 2009-01 2009-02

With the lambda function above we’ve taken the data from the OrderDate column and transformed into a YYYY-MM string. This is now our CohortGroup, since it allows as to group the data by the month of the purchase.

df['CohortGroup'] = df.groupby(level=0)['OrderDate'].min().apply(lambda x: x.strftime('%Y-%m'))

If we look at our dataframe now, it will be ordered by UserID with the earliest purchase on top with the relevant cohort group:

UserId OrderId OrderDate TotalCharges CommonId PupId PickupDate CohortGroup OrderPeriod
0 47 262 2009-01-11 50.6700 TRQKD 2 2009-01-12 2009-01 2009-01
1 47 278 2009-01-20 26.6000 4HH2S 3 2009-01-20 2009-01 2009-01
2 47 294 2009-02-03 38.7100 3TRDC 2 2009-02-04 2009-01 2009-02
3 47 301 2009-02-06 53.3800 NGAZJ 2 2009-02-09 2009-01 2009-02
4 47 302 2009-02-06 14.2800 FFYHD 2 2009-02-09 2009-01 2009-02
5 47 321 2009-02-17 29.5000 HA5R3 3 2009-02-17 2009-01 2009-02
6 47 333 2009-02-23 18.9100 RSXQG 2 2009-02-23 2009-01 2009-02
7 47 341 2009-03-01 64.4100 KHBDJ 3 2009-03-03 2009-01 2009-02

It would be all good only if the UserID column wouldnt be used as Index. For the purpose of the following steps, we must make sure that the UserID column is usable and not employed as the dataframe index:


Now we can start aggregating all the relevant variables, Users, Orders and TotalCharges. The lines below apply sum to TotalCharges and run a SQL-like count distinct select to OrderIDs and UserIDs:

grouped = df.groupby(['OrderPeriod','CohortGroup'])

# count the unique users, orders, and total charges per Group + Period
cohorts = grouped.agg({'UserId': pd.Series.nunique,
 'OrderId': pd.Series.nunique,
 'TotalCharges': np.sum})

# make the column names more meaningful
cohorts.rename(columns={'UserId': 'TotalUsers',
 'OrderId': 'TotalOrders',
 'TotalCharges': 'TotalValue'}, inplace=True)

Now that we have all our relevant data nicely aggregated, we still need to build the actual reference for the period of the cohort we want to analise:

def cohort_period(df):
 df['cohort_period'] = np.arange(len(df))+1
 return df

cohorts = cohorts.groupby(level=0).apply(cohort_period)

The function above simply reference each cohort to the month of the first purchase.

We’re almost there: now we just need to visualize things the right way. The Pandas unstuck function allows us to pivot to the required level so that we can shape the dataframe the way we need to to plot a headmap:

cohort_group_size = cohorts['TotalUsers'].groupby(level=0).first()
user_retention = cohorts['TotalUsers'].unstack(0).divide(cohort_group_size, axis=1)

And finally, this is the seaborn code to plot the heatmap:

plt.figure(figsize = (11,9))
plt.title('Cohort Analysis - Retention Rate')
sns.heatmap(data = user_retention, 
 annot = True, 
 fmt = '.0%', 
 vmin = 0.0,
 vmax = 0.5,
 cmap = "YlGnBu")

The end result should look something like this:

Ecommerce Analytics Playbook – Cohort Analysis