Ecommerce Analytics Playbook – Cohort Analysis

This article describe the steps on how to run a cohort analysis in python. However is also part of something more extended that I always wanted to work on. In the past few weeks I’ve written a series of articles on practical use of Python for specific eCommerce use case. Consider it as a basic toolkit of concepts and methodology for analysts to quickly answer the most relevant questions around an eCommerce business, from overall performance to active/repeat buyers, profitability, lifetime value or basket composition. These should be the essential for any transactional business on the internet.

Please note this is still a work-in-progress. I’m consolidating years of notes and smaller projects, so the list below is likely to change:

  1. Cohort Analysis
  2. eCommerce Exploratory Data Analysis
  3. Association Analysis
  4. Customer Segmentation
  5. RFM Analysis
  6. Customer Lifetime Value
  7. TBD

…ok, but WHY??

Most machine learning and data analytics courses provide a solid foundation on methodology and overall statistical knowledge. This is clearly necessarily and indeed it should be the core of any course, however most of them show significant weaknesses in two main areas:
  1. Examples are usually extremely simplified versions of real problems. While this is great for beginners, the reality is often way less forgiving. The average data available in most companies is often incomplete, requires significant cleaning and feature engineering. The transition from class materials to the real world can be quite unforgiving. The closer to a real world scenario we train and learn, the less intimidated we’re going to be.
  2. second and to me the most relevant: there is very little information available for commerce analytics specifically. Most courses provide general knowledge on methods and techniques, but hardly specific use cases for commerce or marketplaces. And most of these courses focus on quite some aspirational goals totally overlooking the basics (e.g. active/repeat buyers, CLV, basked analysis…). Most students would finalize those courses with a pretty decent understanding of statistics but could hardly relate which tool is best for which problem.
The goal here is to provide the reader with enough knowledge to solve real day to day problems in any commerce or transactional environment using Python and machine learning libraries, clearly relating each use case with the right tool and eventually having a positive impact to the bottom line of the company.

…so who is this for??

I’d rather say for who is this NOT:

  • If you’re an accomplished data scientist on the 0.5% of Kaggle, you’d be better closing this tab and switch over to reddit. I’m afraid there’rs nothing
  • If you’re not a data scientist but know enough Python to build and deploy your own models, you’d be also better getting on reddit now


  • If you’re a marketer or a business lead with a pretty decent technical background and some coding, you could enjoy this
  • If you’re a business analyst with strong SQL skills and decent business acumen, this could be a decent intro (and mostly immediately actionable) to Python and data science.

…fine, I’m neither one of the last two categories but I want to do this. What do I need?

  • Some familiarity with Python. Don’t expect to understand each and every line of code but at least have a grasp of the language
  • Python3.6 installed on your machine (look into this guide for OSX and this other one for windows if you dont have any). Also Jupiter Notebooks is a great tool to build and test stuff locally.
  • Some business understanding (e.g. you must know how an eCom business work, but that should go without saying).

OK, let’s start.

What is Cohort Analysis

Straight up from wikipedia:

Cohort analysis is a subset of behavioral analytics that takes the data from a given data set (e.g. an EMRS, an e-commerce platform, web application, or online game) and rather than looking at all users as one unit, it breaks them into related groups for analysis. These related groups, or cohorts, usually share common characteristics or experiences within a defined time-span.

What this means is that we’ll be slicing our dataset in different bits with the same characteristics and we’ll try to understand how differently these groups behave through time.

Always keep in mind that cohorts are mutually exclusive segments that compare metrics across product lifecycle but also compare metrics across customer lifecycle. Essentially you can have time cohorts, behavior cohorts and size cohorts. For this example we will be looking at behaviour through the customer lifecycle, measuring time between first and following purchases.

Why (eventually) every business needs to do Cohort Analysis

Current eCommerce often struggle to switch from a transactional/sales-oriented to a customer-based perspective. A last-click-sales-oriented perspective of the business is one that primates gross revenue, average order value and sales and often neglects active buyers, repeat buyers, new or reactivated buyers. These metrics are all complementary to each other.
This mindset has a lot to do with the level of maturity of an analytics organization. Early stages eCommerce are mostly focused on sales, mostly because of the lack of volume of clients and the constant needs to refine their operational
These metrics are even more relevant for any SaaS product in  which the transactional perspective is less relevant (you might have a limited number of tiers or plans, with little impact in terms of sales variance) therefore the important metrics are tied directly to the user, the amount of time it decides to remain with the product and what takes them to leave the platform.

The dataset

Here’s the dataset. This is a pretty basic purchase dataset, with customer IDs, OrderDate, OrderID. In the real world you might need to do quite some wrangling before getting to a clear table like this (possibly join a user data table with a purchase table for instance). The most relevant characteristic here is that each purchase or order has a “pickup date” associated and a Order Date, which is also related to a specific user ID. With these three pieces of information we can clearly understand what was a user first purchase, if the same user went back purchasing again later, which is essentially what we want to understand. The time of purchase will be the defying dimension of our dataset.

Now we could easily make things a notch more complicated, adding for instance the “channel” dimension, or the “device” dimension (are users from mobile devices coming from paid search campaigns more likely to repeat their purchase etc. etc.?). In that case we should add this new variable to the mix and create a further drill-down dimension to our cohorts, however for the purpose of this example we would stick to the most basic scenario.

The process

  1. Review the dataset
  2. Create additional attribute and necessary feature engineering
  3. Review and visualize the cohorts

The code

Let’s fire up all the relevant libraries. Numpy and Pandas will be required for some wrangling, Seaborn and Matplotlib for visualization. Datetime will be required to work on the time attribute of the dataset

import numpy as np
import pandas as pd
import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt

Now we need to read the excel file:

df = pd.read_excel('relay_foods.xlsx')

Let’s look at the dataframe:

OrderId OrderDate UserId TotalCharges CommonId PupId PickupDate
0 262 2009-01-11 47 50.67 TRQKD 2 2009-01-12
1 278 2009-01-20 47 26.60 4HH2S 3 2009-01-20
2 294 2009-02-03 47 38.71 3TRDC 2 2009-02-04
3 301 2009-02-06 47 53.38 NGAZJ 2 2009-02-09
4 302 2009-02-06 47 14.28 FFYHD 2 2009-02-09

The OrderDate column shows year, month and day. For the purpose of this exercise, we’ll be creating cohorts based on months. For this reason, we’ll need an additional column based on the OrderDate column, stating only the month:

#process order period date to cohort label

df['OrderPeriod'] = df.OrderDate.apply(lambda x: x.strftime('%Y-%m'))

Let’s look again at our dataframe:

OrderId OrderDate TotalCharges CommonId PupId PickupDate CohortGroup OrderPeriod
47 262 2009-01-11 50.67 TRQKD 2 2009-01-12 2009-01 2009-01
47 278 2009-01-20 26.60 4HH2S 3 2009-01-20 2009-01 2009-01
47 294 2009-02-03 38.71 3TRDC 2 2009-02-04 2009-01 2009-02
47 301 2009-02-06 53.38 NGAZJ 2 2009-02-09 2009-01 2009-02
47 302 2009-02-06 14.28 FFYHD 2 2009-02-09 2009-01 2009-02

With the lambda function above we’ve taken the data from the OrderDate column and transformed into a YYYY-MM string. This is now our CohortGroup, since it allows as to group the data by the month of the purchase.

df['CohortGroup'] = df.groupby(level=0)['OrderDate'].min().apply(lambda x: x.strftime('%Y-%m'))

If we look at our dataframe now, it will be ordered by UserID with the earliest purchase on top with the relevant cohort group:

UserId OrderId OrderDate TotalCharges CommonId PupId PickupDate CohortGroup OrderPeriod
0 47 262 2009-01-11 50.6700 TRQKD 2 2009-01-12 2009-01 2009-01
1 47 278 2009-01-20 26.6000 4HH2S 3 2009-01-20 2009-01 2009-01
2 47 294 2009-02-03 38.7100 3TRDC 2 2009-02-04 2009-01 2009-02
3 47 301 2009-02-06 53.3800 NGAZJ 2 2009-02-09 2009-01 2009-02
4 47 302 2009-02-06 14.2800 FFYHD 2 2009-02-09 2009-01 2009-02
5 47 321 2009-02-17 29.5000 HA5R3 3 2009-02-17 2009-01 2009-02
6 47 333 2009-02-23 18.9100 RSXQG 2 2009-02-23 2009-01 2009-02
7 47 341 2009-03-01 64.4100 KHBDJ 3 2009-03-03 2009-01 2009-02

It would be all good only if the UserID column wouldnt be used as Index. For the purpose of the following steps, we must make sure that the UserID column is usable and not employed as the dataframe index:


Now we can start aggregating all the relevant variables, Users, Orders and TotalCharges. The lines below apply sum to TotalCharges and run a SQL-like count distinct select to OrderIDs and UserIDs:

grouped = df.groupby(['OrderPeriod','CohortGroup'])

# count the unique users, orders, and total charges per Group + Period
cohorts = grouped.agg({'UserId': pd.Series.nunique,
 'OrderId': pd.Series.nunique,
 'TotalCharges': np.sum})

# make the column names more meaningful
cohorts.rename(columns={'UserId': 'TotalUsers',
 'OrderId': 'TotalOrders',
 'TotalCharges': 'TotalValue'}, inplace=True)

Now that we have all our relevant data nicely aggregated, we still need to build the actual reference for the period of the cohort we want to analise:

def cohort_period(df):
 df['cohort_period'] = np.arange(len(df))+1
 return df

cohorts = cohorts.groupby(level=0).apply(cohort_period)

The function above simply reference each cohort to the month of the first purchase.

We’re almost there: now we just need to visualize things the right way. The Pandas unstuck function allows us to pivot to the required level so that we can shape the dataframe the way we need to to plot a headmap:

cohort_group_size = cohorts['TotalUsers'].groupby(level=0).first()
user_retention = cohorts['TotalUsers'].unstack(0).divide(cohort_group_size, axis=1)

And finally, this is the seaborn code to plot the heatmap:

plt.figure(figsize = (11,9))
plt.title('Cohort Analysis - Retention Rate')
sns.heatmap(data = user_retention, 
 annot = True, 
 fmt = '.0%', 
 vmin = 0.0,
 vmax = 0.5,
 cmap = "YlGnBu")

The end result should look something like this:

Ecommerce Analytics Playbook – Cohort Analysis

CRO and Pricing Optimization

This post previously appeared on and is also available on the EIM Blog.

In the world of ecommerce and online marketing, conversion rate optimization (CRO) is used to ensure companies maximize their revenue by providing the best experiences to consumers. It’s a blend of user experience design, data analysis, and strict experimental methodology to test what kind of online experience and product pricing consumers respond most favorably to in order to then convert. In this article, envisionit’s CRO expert, Fabio Fanni, explains some of the methodology we use for our clients to ensure they’re getting the most out of their spending to reach their goals of a larger user base, higher profits, or whatever their target may be.

A/B and Multivariate Testing

A/B testing, or split testing, is a process where we run a simultaneous experiment between two pages to see which performs better. With A/B testing we validate or reject changes and modifications depending on their performance and how they affect our KPIs.

We usually hear about A/B and multivariate testing in UX design, user engagement and conversion optimization. These are some of the most common contexts of use for these tools.

However, for those cases in which there is no customer relationship management (CRM) backing up the pricing strategy, we can actually use these tools to find out the right price range for products yielding the highest revenue per click and profit margin, thus helping us to maximize ROI and advertisement spending.

With this method we are able to find out if a lower price drives more traffic or a higher price reduces conversions but delivers a higher profit for sale. This can be extremely helpful for those products whose pricing can be hard to define (online/digital SaaS) or industries in which the acquisition costs significantly affect net profits (travel, ecommerce retail etc.).

pricing optimization with a/b testing

This chart shows the revenue per visitor on the y-axis and the actual product price on the x-axis. In this example, we tested different prices on a $2 increase scale and tracked revenue per visitor. Considering that a lower price generates a higher conversion rate, revenue per visitor increased up to $3.1 with a $11 price.

Despite the fact that the lowest prices increased the conversion rate and brought in more clients, the improvement was actually offset by the lower price and eventually affected the profit per user. With this method, it is fairly easy to find the sweet spot that generates the highest profit, and most importantly, provides valuable insights about price elasticity.

This process can also be applied with a different objective in mind. Let´s say that we want to grow our user base within the break-even limit, regardless of profit: we would choose the intersection of price with the highest number of conversions in which a profit per user = $0.

Base price and upselling optimization

Regardless of how far we can go optimizing our landing page price, this can turn pretty tricky if we consider a base price and upselling price.

In fact, the right price range for your base product might not be the same for the total upselling price. This case is pretty common in those industries that strongly rely on ancillary revenues (travel and accommodation) and generic upselling (banks, insurance). Upsells tend to create bias and skew value perception, eventually affecting price elasticity.

pricing optimization with a/b and multivariate testing

As for this chart, the optimal price range for the base product is not the same as for the total up selling product. In this case, we would need a multivariate test rather than a standard A/B. We would run multiple combinations of base and total price.

Things to Consider

Now this looks like a reliable way to make some very important pricing decisions. Even though this methodology is solid, before taking on an extensive pricing optimization effort it’s important to always take into consideration the following:

1. Achieving statistical significance could take too long or require too much traffic.

The difference in performance between the different pricing levels could be minimal, which means that the amount of traffic required to achieve significance would be too high. Always make sure you are comfortable with the level of significance you want to achieve and the amount of traffic you need. You might want to find out how much traffic and time you might need in order to achieve those results beforehand using this tool from Optimizely.

2. Continuously testing different prices could confuse users and eventually backfire.

This could be especially true for digital SaaS products whose purchase journey can have multiple touch points and prospective clients might be visiting the pricing page with some frequency. Narrow down the testing to certain geographical areas or in order to avoid this type of problems.

It’s important to keep hard data in mind and use strict methodology when evaluating pricing and optimizing conversions, but intuition and creativity play key roles as well – something Fabio and the rest of the envisionit teams excel at in order to reach our clients’ target customers. Get in touch with us today to learn how we can bring the latest in marketing and creative work to your brand.

CRO and Pricing Optimization