One of the questions that arise when we store event level log for analysis is for how long do we save history? I wanted to identify the optimal time required to keep a data in the database.
In statistics, especially in Biostatistics there is a technique called Survival analysis, which done to understand how long a patient lives after certain treatment is given. It is also done to compare between different treatments. You can read more about Survival analysis in https://en.wikipedia.org/wiki/Survival_analysis. There are many models developed on Survival statistics out of which two most widely used models are Kaplan-Meier and Cox Regression.
In my analysis I have used Kaplan-Meier since my only goal here is to find the optimal time to keep the history. Cox regression is used to find more about the causation of the event itself, it helps us in explaining what variables contribute to an event. We will use the hazard function also known as survival function (https://en.wikipedia.org/wiki/Survival_function ) to come up the rate at which the users are dropping out. We will use Kaplan-Meier estimator for this (https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator ).
Here I’m taking last 6 months of traffic data to get all the COOKIE ID’s and their event date. I’m making an assumption that any COOKIE ID which is inactive for 30 days is considered as churned and an event is recorded with its last active date, along with that the logic for censoring is – no event by end of fixed study period which is 6 months.
SELECT a.cookie_id , a.last_active_date , a.first_seen_date , datediff ( 1 to_date (a.last_active_date) ,to_date (a.first_seen_date) ) + 1 AS total_time , CASE WHEN datediff ( to_date ('2019-05-29') ,to_date (a.last_active_date) ) > 30 THEN 1 ELSE 0 END AS event FROM ( SELECT cookie_id , MIN( event_date ) AS first_seen_date , MAX( event_date ) AS last_active_date FROM <event_data>WHERE event_date > '2018-12-01' GROUP BY cookie_id) a;
Kaplan-MeierEstimator
%matplotlib inline
from lifelines import KaplanMeierFitter
## create a kmf object
kmf = KaplanMeierFitter()
import pandas as pd
draft_df = pd.read_csv("set.csv",names=("total_time","event"))
draft_df.head()
total_time event 0 1 1 1 1 1 2 1 1 3 1 1 4 1 0
kmf.fit(durations = draft_df.total_time,
event_observed = draft_df.event)
kmf.plot(ci_show=False)

kmf.survival_function_
KM_estimate timeline 0.0 1.000000 1.0 0.333226 2.0 0.310522 ... ... 21.0 0.231396 22.0 0.228262 23.0 0.225422 24.0 0.222873 25.0 0.220386 26.0 0.217967 27.0 0.215404 28.0 0.212746 29.0 0.210033 ... ... 150.0 0.072630 151.0 0.072630 152.0 0.072630 153.0 0.072630 154.0 0.072630 ... ... 177.0 0.072630 178.0 0.072630 179.0 0.072630
In the data and the plot above we can see that almost 67% of the COOKIE ID’s drop out in one day. And the rate of dropping out slows after day 1, but almost 75% of the COOKIE ID’s have dropped out by day 17 and 80% dropped out by day 30.
Below is the famous Kaplan-Meier event table/ life table for those who are interested
kmf.event_table
removed observed censored entrance at_risk event_at 0.0 0 0 0 1000000 1000000 1.0 810798 666774 144024 0 1000000 2.0 16019 12891 3128 0 189202 3.0 6760 5316 1444 0 173183 4.0 4994 3926 1068 0 166423 5.0 4167 3201 966 0 161429 6.0 3782 2915 867 0 157262 7.0 3422 2606 816 0 153480 8.0 3524 2672 852 0 150058 9.0 2953 2277 676 0 146534 10.0 2539 1949 590 0 143581 11.0 2380 1864 516 0 141042 12.0 2297 1743 554 0 138662 13.0 2278 1701 577 0 136365 14.0 2400 1786 614 0 134087 15.0 2378 1736 642 0 131687 16.0 2189 1610 579 0 129309 17.0 1881 1450 431 0 127120 18.0 1847 1414 433 0 125239 19.0 1815 1405 410 0 123392 20.0 1850 1387 463 0 121577 21.0 1960 1481 479 0 119727 22.0 2116 1595 521 0 117767 23.0 1932 1439 493 0 115651 24.0 1705 1286 419 0 113719 25.0 1664 1250 414 0 112014 ........
Conclution:
Based on the available data we can recommend any date between 17 to 30, but it is up to the business to set the date based on how many users they would like to see at a given time.