When to use cohort analysis
When we want to decide how fast a group of elements move from one stage to another, and how this speed changes over time.
e.g. How long does it take from a user submits an application to their first transaction? How does this speed change over time?
Why calculating the average time difference between two stages are very biased
If we calculate the average time difference, this means we are only considering the data points that already have the timestamps for both events, and we cannot consider any data points that have not reached the second stage yet. This means more recent data points will have a much shorter average time, but this reduce in average time does not mean an improvement in speed.
If it will take 30 days for most of applications submitted yesterday to be processed, but one application was processed, the average processing time is 1.
How cohort analysis solves this problem
Instead of calculating the average time, cohort analysis calculates the change of percentage of successful data points over time within each cohort. i.e. instead of having only one controlled variable (start time/first stage time) and measure the finish time/ second stage time, we now control both start time and finish tme, and measure success rate.
How to do cohort analysis
There are four columns needed for a cohort analysis:
- Cohort date (time of application submission time grouped by the start of the month)
- Period (number of periods that has passed since cohort date)
- Total cohort size (number of data points in the cohort, number of applications submitted in the cohort month)
- Count Satisfying Target (number of applications submitted during [cohort date] month that started transacting when [period] number of months have passed)
Things to take notice
- You can use
with
clause to prepare the cohort date (start of period) and cohort size - Be sure not to include the data points that do not have a second stage date, e.g. with
inner join
1 |
Variation
I varied my cohort analysis to be “cumulative”, i.e. for count satisfying target, instead of number of applications submitted during [cohort date] month that started transacting when [period] number of months have passed, I do number of applications submitted during [cohort date] month that have started transacting when [period] number of months have passed.
Reference:https://redash.io/help/user-guide/visualizations/cohort-howto