Dineshkumar Murugan

One of the questions that arise when we store event level log for analysis is for how long do we save history? I wanted to identify the optimal time required to keep a data in the database.

In statistics, especially in Biostatistics there is a technique called Survival analysis, which done to understand how long a patient lives after certain treatment is given. It is also done to compare between different treatments. You can read more about Survival analysis in https://en.wikipedia.org/wiki/Survival_analysis. There are many models developed on Survival statistics out of which two most widely used models are Kaplan-Meier and Cox Regression.

In my analysis I have used Kaplan-Meier since my only goal here is to find the optimal time to keep the history. Cox regression is used to find more about the causation of the event itself, it helps us in explaining what variables contribute to an event. We will use the hazard function also known as survival function (https://en.wikipedia.org/wiki/Survival_function ) to come up the rate at which the users are dropping out. We will use Kaplan-Meier estimator for this (https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator ).

Here I’m taking last 6 months of traffic data to get all the COOKIE ID’s and their event date. I’m making an assumption that any COOKIE ID which is inactive for 30 days is considered as churned and an event is recorded with its last active date, along with that the logic for censoring is – no event by end of fixed study period which is 6 months.

        SELECT a.cookie_id ,
               a.last_active_date ,
               a.first_seen_date ,
               datediff ( 1 to_date (a.last_active_date) ,to_date (a.first_seen_date) ) + 1 AS total_time ,
               CASE WHEN datediff ( to_date ('2019-05-29') ,to_date (a.last_active_date) ) > 30 THEN 
                     1 ELSE 0 END AS event 
               FROM 
                     ( SELECT cookie_id ,
                     MIN( event_date ) AS first_seen_date ,
                     MAX( event_date ) AS last_active_date 
                     FROM <event_data>
                          WHERE event_date > '2018-12-01' GROUP BY cookie_id ) a;

Kaplan-MeierEstimator

%matplotlib inline
from lifelines import KaplanMeierFitter
## create a kmf object
kmf = KaplanMeierFitter()

import pandas as pd

draft_df = pd.read_csv("set.csv",names=("total_time","event"))
draft_df.head()

total_time event
0 1 1
1 1 1
2 1 1
3 1 1
4 1 0

kmf.fit(durations = draft_df.total_time,
event_observed = draft_df.event)

kmf.plot(ci_show=False)

kmf.survival_function_

KM_estimate
timeline
0.0 1.000000
1.0 0.333226
2.0 0.310522
... ...
21.0 0.231396
22.0 0.228262
23.0 0.225422
24.0 0.222873
25.0 0.220386
26.0 0.217967
27.0 0.215404
28.0 0.212746
29.0 0.210033
... ...
150.0 0.072630
151.0 0.072630
152.0 0.072630
153.0 0.072630
154.0 0.072630
... ...
177.0 0.072630
178.0 0.072630
179.0 0.072630

In the data and the plot above we can see that almost 67% of the COOKIE ID’s drop out in one day. And the rate of dropping out slows after day 1, but almost 75% of the COOKIE ID’s have dropped out by day 17 and 80% dropped out by day 30.
Below is the famous Kaplan-Meier event table/ life table for those who are interested

kmf.event_table

removed observed censored entrance at_risk
event_at
0.0 0 0 0 1000000 1000000
1.0 810798 666774 144024 0 1000000
2.0 16019 12891 3128 0 189202
3.0 6760 5316 1444 0 173183
4.0 4994 3926 1068 0 166423
5.0 4167 3201 966 0 161429
6.0 3782 2915 867 0 157262
7.0 3422 2606 816 0 153480
8.0 3524 2672 852 0 150058
9.0 2953 2277 676 0 146534
10.0 2539 1949 590 0 143581
11.0 2380 1864 516 0 141042
12.0 2297 1743 554 0 138662
13.0 2278 1701 577 0 136365
14.0 2400 1786 614 0 134087
15.0 2378 1736 642 0 131687
16.0 2189 1610 579 0 129309
17.0 1881 1450 431 0 127120
18.0 1847 1414 433 0 125239
19.0 1815 1405 410 0 123392
20.0 1850 1387 463 0 121577
21.0 1960 1481 479 0 119727
22.0 2116 1595 521 0 117767
23.0 1932 1439 493 0 115651
24.0 1705 1286 419 0 113719
25.0 1664 1250 414 0 112014
........

Conclution:

Based on the available data we can recommend any date between 17 to 30, but it is up to the business to set the date based on how many users they would like to see at a given time.