In the past few weeks I’ve written a series of articles on practical use of Python for specific eCommerce use case. Consider it as a basic toolkit of concepts and methodology for analysts to quickly answer the most relevant questions around an eCommerce business, from overall performance to active/repeat buyers, profitability, lifetime value or basket composition. These should be the essential for any transactional business on the internet.
Please note this is still a work-in-progress. I’m consolidating years of notes and smaller projects, so the list below is likely to change:
- Cohort Analysis
- eCommerce Exploratory Data Analysis
- Association Analysis
- Customer Segmentation
- RFM Analysis
- Customer Lifetime Value
- TBD
…ok, but WHY??
- Examples are usually extremely simplified versions of real problems. While this is great for beginners, the reality is often way less forgiving. The average data available in most companies is often incomplete, requires significant cleaning and feature engineering. The transition from class materials to the real world can be quite unforgiving. The closer to a real world scenario we train and learn, the less intimidated we’re going to be.
- second and to me the most relevant: there is very little information available for commerce analytics specifically. Most courses provide general knowledge on methods and techniques, but hardly specific use cases for commerce or marketplaces. And most of these courses focus on quite some aspirational goals totally overlooking the basics (e.g. active/repeat buyers, CLV, basked analysis…). Most students would finalize those courses with a pretty decent understanding of statistics but could hardly relate which tool is best for which problem.
…so who is this for??
I’d rather say for who is this NOT:
- If you’re an accomplished data scientist on the 0.5% of Kaggle, you’d be better closing this tab and switch over to reddit. I’m afraid there’rs nothing
- If you’re not a data scientist but know enough Python to build and deploy your own models, you’d be also better getting on reddit now
However…
- If you’re a marketer or a business lead with a pretty decent technical background and some coding, you could enjoy this
- If you’re a business analyst with strong SQL skills and decent business acumen, this could be a decent intro (and mostly immediately actionable) to Python and data science.
…fine, I’m neither one of the last two categories but I want to do this. What do I need?
- Some familiarity with Python. Don’t expect to understand each and every line of code but at least have a grasp of the language
- Python3.6 installed on your machine (look into this guide for OSX and this other one for windows if you dont have any). Also Jupiter Notebooks is a great tool to build and test stuff locally.
- Some business understanding (e.g. you must know how an eCom business work, but that should go without saying).
OK, let’s start.
What is Cohort Analysis
Straight up from wikipedia:
Cohort analysis is a subset of behavioral analytics that takes the data from a given data set (e.g. an EMRS, an e-commerce platform, web application, or online game) and rather than looking at all users as one unit, it breaks them into related groups for analysis. These related groups, or cohorts, usually share common characteristics or experiences within a defined time-span.
What this means is that we’ll be slicing our dataset in different bits with the same characteristics and we’ll try to understand how differently these groups behave through time.
Always keep in mind that cohorts are mutually exclusive segments that compare metrics across product lifecycle but also compare metrics across customer lifecycle. Essentially you can have time cohorts, behavior cohorts and size cohorts. For this example we will be looking at behaviour through the customer lifecycle, measuring time between first and following purchases.
Why (eventually) every business needs to do Cohort Analysis
The dataset
Here’s the dataset. This is a pretty basic purchase dataset, with customer IDs, OrderDate, OrderID. In the real world you might need to do quite some wrangling before getting to a clear table like this (possibly join a user data table with a purchase table for instance). The most relevant characteristic here is that each purchase or order has a “pickup date” associated and a Order Date, which is also related to a specific user ID. With these three pieces of information we can clearly understand what was a user first purchase, if the same user went back purchasing again later, which is essentially what we want to understand. The time of purchase will be the defying dimension of our dataset.
Now we could easily make things a notch more complicated, adding for instance the “channel” dimension, or the “device” dimension (are users from mobile devices coming from paid search campaigns more likely to repeat their purchase etc. etc.?). In that case we should add this new variable to the mix and create a further drill-down dimension to our cohorts, however for the purpose of this example we would stick to the most basic scenario.
The process
- Review the dataset
- Create additional attribute and necessary feature engineering
- Review and visualize the cohorts
The code
Let’s fire up all the relevant libraries. Numpy and Pandas will be required for some wrangling, Seaborn and Matplotlib for visualization. Datetime will be required to work on the time attribute of the dataset
import numpy as np import pandas as pd import datetime as dt import seaborn as sns import matplotlib.pyplot as plt
Now we need to read the excel file:
df = pd.read_excel('relay_foods.xlsx') df.head()
Let’s look at the dataframe:
OrderId | OrderDate | UserId | TotalCharges | CommonId | PupId | PickupDate | |
---|---|---|---|---|---|---|---|
0 | 262 | 2009-01-11 | 47 | 50.67 | TRQKD | 2 | 2009-01-12 |
1 | 278 | 2009-01-20 | 47 | 26.60 | 4HH2S | 3 | 2009-01-20 |
2 | 294 | 2009-02-03 | 47 | 38.71 | 3TRDC | 2 | 2009-02-04 |
3 | 301 | 2009-02-06 | 47 | 53.38 | NGAZJ | 2 | 2009-02-09 |
4 | 302 | 2009-02-06 | 47 | 14.28 | FFYHD | 2 | 2009-02-09 |
The OrderDate column shows year, month and day. For the purpose of this exercise, we’ll be creating cohorts based on months. For this reason, we’ll need an additional column based on the OrderDate column, stating only the month:
#process order period date to cohort label df['OrderPeriod'] = df.OrderDate.apply(lambda x: x.strftime('%Y-%m'))
Let’s look again at our dataframe:
OrderId | OrderDate | TotalCharges | CommonId | PupId | PickupDate | CohortGroup | OrderPeriod | |
---|---|---|---|---|---|---|---|---|
UserId | ||||||||
47 | 262 | 2009-01-11 | 50.67 | TRQKD | 2 | 2009-01-12 | 2009-01 | 2009-01 |
47 | 278 | 2009-01-20 | 26.60 | 4HH2S | 3 | 2009-01-20 | 2009-01 | 2009-01 |
47 | 294 | 2009-02-03 | 38.71 | 3TRDC | 2 | 2009-02-04 | 2009-01 | 2009-02 |
47 | 301 | 2009-02-06 | 53.38 | NGAZJ | 2 | 2009-02-09 | 2009-01 | 2009-02 |
47 | 302 | 2009-02-06 | 14.28 | FFYHD | 2 | 2009-02-09 | 2009-01 | 2009-02 |
With the lambda function above we’ve taken the data from the OrderDate column and transformed into a YYYY-MM string. This is now our CohortGroup, since it allows as to group the data by the month of the purchase.
df['CohortGroup'] = df.groupby(level=0)['OrderDate'].min().apply(lambda x: x.strftime('%Y-%m'))
If we look at our dataframe now, it will be ordered by UserID with the earliest purchase on top with the relevant cohort group:
UserId | OrderId | OrderDate | TotalCharges | CommonId | PupId | PickupDate | CohortGroup | OrderPeriod | |
---|---|---|---|---|---|---|---|---|---|
0 | 47 | 262 | 2009-01-11 | 50.6700 | TRQKD | 2 | 2009-01-12 | 2009-01 | 2009-01 |
1 | 47 | 278 | 2009-01-20 | 26.6000 | 4HH2S | 3 | 2009-01-20 | 2009-01 | 2009-01 |
2 | 47 | 294 | 2009-02-03 | 38.7100 | 3TRDC | 2 | 2009-02-04 | 2009-01 | 2009-02 |
3 | 47 | 301 | 2009-02-06 | 53.3800 | NGAZJ | 2 | 2009-02-09 | 2009-01 | 2009-02 |
4 | 47 | 302 | 2009-02-06 | 14.2800 | FFYHD | 2 | 2009-02-09 | 2009-01 | 2009-02 |
5 | 47 | 321 | 2009-02-17 | 29.5000 | HA5R3 | 3 | 2009-02-17 | 2009-01 | 2009-02 |
6 | 47 | 333 | 2009-02-23 | 18.9100 | RSXQG | 2 | 2009-02-23 | 2009-01 | 2009-02 |
7 | 47 | 341 | 2009-03-01 | 64.4100 | KHBDJ | 3 | 2009-03-03 | 2009-01 | 2009-02 |
It would be all good only if the UserID column wouldnt be used as Index. For the purpose of the following steps, we must make sure that the UserID column is usable and not employed as the dataframe index:
df.reset_index(inplace=True)
Now we can start aggregating all the relevant variables, Users, Orders and TotalCharges. The lines below apply sum to TotalCharges and run a SQL-like count distinct select to OrderIDs and UserIDs:
grouped = df.groupby(['OrderPeriod','CohortGroup']) # count the unique users, orders, and total charges per Group + Period cohorts = grouped.agg({'UserId': pd.Series.nunique, 'OrderId': pd.Series.nunique, 'TotalCharges': np.sum}) # make the column names more meaningful cohorts.rename(columns={'UserId': 'TotalUsers', 'OrderId': 'TotalOrders', 'TotalCharges': 'TotalValue'}, inplace=True) cohorts.head()
Now that we have all our relevant data nicely aggregated, we still need to build the actual reference for the period of the cohort we want to analise:
def cohort_period(df): df['cohort_period'] = np.arange(len(df))+1 return df cohorts = cohorts.groupby(level=0).apply(cohort_period) cohorts.head()
The function above simply reference each cohort to the month of the first purchase.
We’re almost there: now we just need to visualize things the right way. The Pandas unstuck function allows us to pivot to the required level so that we can shape the dataframe the way we need to to plot a headmap:
cohort_group_size = cohorts['TotalUsers'].groupby(level=0).first() user_retention = cohorts['TotalUsers'].unstack(0).divide(cohort_group_size, axis=1) user_retention
And finally, this is the seaborn code to plot the heatmap:
plt.figure(figsize = (11,9)) plt.title('Cohort Analysis - Retention Rate') sns.heatmap(data = user_retention, annot = True, fmt = '.0%', vmin = 0.0, vmax = 0.5, cmap = "YlGnBu") plt.show()
The end result should look something like this: