# Lifelines¶

*lifelines* is a implementation of survival analysis in Python. What
benefits does *lifelines* offer over other survival analysis
implementations?

- built on top of Pandas
- internal plotting methods
- simple and intuitive API (
*designed for humans*) - only does survival analysis (No unnecessary features or second-class implementations)

## Contents:¶

### Quickstart¶

#### Installation¶

Install via `pip`

(see its documentation if it is not yet installed on your system):

```
pip install lifelines
```

#### Kaplan-Meier and Nelson-Aalen¶

Let’s start by importing some data. We need the durations that individuals are observed for, and whether they “died” or not.

```
from lifelines.datasets import load_waltons
df = load_waltons() # returns a Pandas DataFrame
print(df.head())
"""
T E group
0 6 1 miR-137
1 13 1 miR-137
2 13 1 miR-137
3 13 1 miR-137
4 19 1 miR-137
"""
T = df['T']
E = df['E']
```

`T`

is an array of durations, `E`

is a either boolean or binary array representing whether the “death” was observed (alternatively an individual can be censored).

Note

By default, *lifelines* assumes all “deaths” are observed.

```
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(T, event_observed=E) # or, more succiently, kmf.fit(T, E)
```

After calling the `fit`

method, we have access to new properties like `survival_function_`

and methods like `plot()`

. The latter is a wrapper around Panda’s internal plotting library.

```
kmf.survival_function_
kmf.median_
kmf.plot()
```

##### Multiple groups¶

```
groups = df['group']
ix = (groups == 'miR-137')
kmf.fit(T[~ix], E[~ix], label='control')
ax = kmf.plot()
kmf.fit(T[ix], E[ix], label='miR-137')
kmf.plot(ax=ax)
```

Similar functionality exists for the `NelsonAalenFitter`

:

```
from lifelines import NelsonAalenFitter
naf = NelsonAalenFitter()
naf.fit(T, event_observed=E)
```

but instead of a `survival_function_`

being exposed, a `cumulative_hazard_`

is.

Note

Similar to Scikit-Learn, all statistically estimated quantities append an underscore to the property name.

#### Getting Data in The Right Format¶

Often you’ll have data that looks like:

*start_time*, *end_time*

Lifelines has some utility functions to transform this dataset into duration and censorship vectors:

```
from lifelines.utils import datetimes_to_durations
# start_times is a vector of datetime objects
# end_times is a vector of (possibly missing) datetime objects.
T, E = datetimes_to_durations(start_times, end_times, freq='h')
```

Alternatively, perhaps you are interested in viewing the survival table given some durations and censorship vectors.

```
from lifelines.utils import survival_table_from_events
table = survival_table_from_events(T, E)
print(table.head())
"""
removed observed censored entrance at_risk
event_at
0 0 0 0 163 163
6 1 1 0 0 163
7 2 1 1 0 162
9 3 3 0 0 160
13 3 3 0 0 157
"""
```

#### Survival Regression¶

While the above `KaplanMeierFitter`

and `NelsonAalenFitter`

are useful, they only give us an “average” view of the population. Often we have specific data at the individual level, either continuous or categorical, that we would like to use. For this, we turn to **survival regression**, specifically `AalenAdditiveFitter`

and `CoxPHFitter`

.

```
from lifelines.datasets import load_regression_dataset
regression_dataset = load_regression_dataset()
regression_dataset.head()
```

The input of the `fit`

method’s API in a regression is different. All the data, including durations, censorships and covariates must be contained in **a Pandas DataFrame** (yes, it must be a DataFrame). The duration column and event occurred column must be specified in the call to `fit`

.

```
from lifelines import CoxPHFitter
# Using Cox Proportional Hazards model
cph = CoxPHFitter()
cph.fit(regression_dataset, 'T', event_col='E')
cph.print_summary()
"""
n=200, number of events=189
coef exp(coef) se(coef) z p lower 0.95 upper 0.95
var1 0.2213 1.2477 0.0743 2.9796 0.0029 0.0757 0.3669 **
var2 0.0509 1.0522 0.0829 0.6139 0.5393 -0.1116 0.2134
var3 0.2186 1.2443 0.0758 2.8836 0.0039 0.0700 0.3672 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Concordance = 0.580
"""
cph.plot()
```

If we focus on Aalen’s Additive model,

```
# Using Aalen's Additive model
from lifelines import AalenAdditiveFitter
aaf = AalenAdditiveFitter(fit_intercept=False)
aaf.fit(regression_dataset, 'T', event_col='E')
```

Like `CoxPHFitter`

, after fitting you’ll have access to properties like `cumulative_hazards_`

and methods like `plot`

, `predict_cumulative_hazards`

, and `predict_survival_function`

. The latter two methods require an additional argument of individual covariates:

```
X = regression_dataset.drop(['E', 'T'], axis=1)
aaf.predict_survival_function(X.iloc[10:12]).plot() # get the unique survival functions of two subjects
```

Like the above estimators, there is also a built-in plotting method:

```
aaf.plot()
```

### Introduction to Survival Analysis¶

#### Applications¶

Traditionally, survival analysis was developed to measure lifespans of individuals. An actuary or health professional would ask questions like “how long does this population live for?”, and answer it using survival analysis. For example, the population may be a nation’s population (for actuaries), or a population stricken by a disease (in the medical professional’s case). Traditionally, sort of a morbid subject.

The analysis can be further applied to not just traditional *births and
deaths*, but any duration. Medical professionals might be interested in
the *time between childbirths*, where a birth in this case is the event
of having a child, and a death is becoming pregnant again! (obviously,
we are loose with our definitions of *birth and death*)! Another example
is users subscribing to a service: a birth is a user who joins the
service, and a death is when the user leaves the service.

#### Censorship¶

At the time you want to make inferences about durations, it is possible, likely true, that not all the death events have occured yet. For example, a medical professional will not wait 50 years for each individual in the study to pass away before investigating – he or she is interested in the effectiveness of improving lifetimes after only a few years, or months possibly.

The individuals in a population who have not been subject to the death
event are labeled as *right-censored*, i.e.,
we did not (or can not) view the rest of their life history
due to some external circumstances. All the information we have on
these individuals are their current lifetime durations (which is
naturally *less* than their actual lifetimes).

Note

There is also left-censorship, where an individuals birth event is not seen.

A common mistake data analysts make is choosing to ignore the right-censored individuals. We shall see why this is a mistake next:

Consider a case where the population is actually made up of two subpopulations, \(A\) and \(B\). Population \(A\) has a very small lifespan, say 2 months on average, and population \(B\) enjoys a much larger lifespan, say 12 months on average. We might not know this distinction before hand. At \(t=10\), we wish to investigate the average lifespan. Below is an example of such a situation.

```
from lifelines.plotting import plot_lifetimes
from numpy.random import uniform, exponential
N = 25
current_time = 10
actual_lifetimes = np.array([[exponential(12), exponential(2)][uniform() < 0.5] for i in range(N)])
observed_lifetimes = np.minimum(actual_lifetimes, current_time)
observed = actual_lifetimes < current_time
plt.xlim(0, 25)
plt.vlines(10, 0, 30, lw=2, linestyles='--')
plt.xlabel("time")
plt.title("Births and deaths of our population, at $t=10$")
plot_lifetimes(observed_lifetimes, event_observed=observed)
print("Observed lifetimes at time %d:\n" % (current_time), observed_lifetimes)
```

```
Observed lifetimes at time 10:
[ 10. 1.1 8. 10. 3.43 0.63 6.28 1.03 2.37 6.17 10.
0.21 2.71 1.25 10. 3.4 0.62 1.94 0.22 7.43 6.16 10.
9.41 10. 10. ]
```

The red lines denote the lifespan of individuals where the death event
has been observed, and the blue lines denote the lifespan of the
right-censored individuals (deaths have not been observed). If we are
asked to estimate the average lifetime of our population, and we naively
decided to *not* included the right-censored individuals, it is clear
that we would be serverly underestimating the true average lifespan.

Furthermore, if we instead simply took the mean of *all* observed
lifespans, including the current lifespans of right-censored instances,
we would *still* be underestimating the true average lifespan. Below we
plot the actual lifetimes of all instances (recall we do not see this
information at \(t=10\)).

```
plt.xlim(0,25)
plt.vlines(10, 0, 30, lw=2, linestyles='--')
plot_lifetimes(actual_lifetimes, event_observed=observed)
```

Survival analysis was originally developed to solve this type of problem, that is, to deal with estimation when our data is right-censored. Even in the case where all events have been observed, i.e. no censorship, survival analysis is still a very useful tool to understand durations.

The observations need not always start at zero, either. This was done
only for understanding in the above example. Consider the example where
a customer entering a store is a birth: a customer can enter at
any time, and not necessarily at time zero. In survival analysis, durations
are relative: individuals may start at different times.
(We actually only need the *duration* of the observation, and not
the necessarily the start and end time.)

We next introduce the two fundamental objects in survival analysis, the
*survival function* and the *hazard function*.

#### Survival function¶

Let \(T\) be a (possibly infinite, but always non-negative) random lifetime taken from the population under study. For example, the amount of time a couple is married. Or the time it takes a user to enter a webpage (an infinite time if they never do). The survival function - \(S(t)\) - of a population is defined as

In plain English: the survival function defines the probability the death event has not occured yet at time \(t\), or equivalently, the probability of surviving past time \(t\). Note the following properties of the survival function:

- \(0 \le S(t) \le 1\)
- \(F_T(t) = 1 - S(t)\), where \(F_T(t)\) is the CDF of \(T\), which implies
- \(S(t)\) is a non-increasing function of \(t\).

#### Hazard curve¶

We are also interested in the probability of the death event occurring at time \(t\), given that the death event has not occurred until time \(t\). Mathematically, that is:

This quantity goes to 0 as \(\delta t\) shrinks, so we divide this by the interval \(\delta t\) (like we might do in calculus). This defines the hazard function at time \(t\), \(\lambda(t)\):

It can be shown with quite elementary probability that this is equal to:

and solving this differential equation (yes, it is a differential equation), we get:

What I love about the above equation is that it defines **all** survival
functions, and because the hazard function is arbitrary (i.e. there is
no parametric form), the entire function is non-parametric (this allows
for very flexible curves). Notice that we can now speak either about the
survival function, \(S(t)\), or the hazard function,
\(\lambda(t)\), and we can convert back and forth quite easily. It
also gives us another, albeit less useful, expression for \(T\):
Upon differentiation and some algebra, we recover:

Of course, we do not observe the true survival curve of a population. We
must use the observed data to estimate it. We also want to continue to
be non-parametric, that is not assume anything about how the
survival curve looks. The *best* method to recreate the survival
function non-parametrically from the data is known as the Kaplan-Meier
estimate, which brings us to estimation using lifelines.

### Survival analysis with lifelines¶

In the previous section,
we introduced the use of survival analysis, the need, and the
mathematical objects on which it relies. In this article, we will work
with real data and the *lifelines* library to estimate these mathematical objects.

#### Estimating the Survival function using Kaplan-Meier¶

For this example, we will be investigating the lifetimes of political leaders around the world. A political leader, in this case, is defined by a single individual’s time in office who controls the ruling regime. This political leader could be an elected president, unelected dictator, monarch, etc. The birth event is the start of the individual’s tenure, and the death event is the retirement of the individual. Censorship can occur if they are a) still in offices at the time of dataset compilation (2008), or b) die while in power (this includes assassinations).

For example, the Bush regime began in 2000 and officially ended in 2008
upon his retirement, thus this regime’s lifespan was eight years, and there was a
“death” event observed. On the other hand, the JFK regime lasted 2
years, from 1961 and 1963, and the regime’s official death event *was
not* observed – JFK died before his official retirement.

(This is an example that has gladly redefined the birth and death events, and in fact completely flips the idea upside down by using deaths as the censorship event. This is also an example where the current time is not the only cause of censorship; there are the alternative events (e.g., death in office) that can be the cause of censorship.

To estimate the survival function, we first will use the Kaplan-Meier Estimate, defined:

where \(d_i\) are the number of death events at time \(t\) and \(n_i\) is the number of subjects at risk of death just prior to time \(t\).

Let’s bring in our dataset.

```
import pandas as pd
from lifelines.datasets import load_dd
data = load_dd()
```

```
data.sample(6)
#the boolean columns `observed` refers to whether the death (leaving office)
#was observed or not.
```

ctryname | cowcode2 | politycode | un_region_name | un_continent_name | ehead | leaderspellreg | democracy | regime | start_year | duration | observed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

164 | Bolivia | 145 | 145.0 | South America | Americas | Rene Barrientos Ortuno | Rene Barrientos Ortuno.Bolivia.1966.1968.Milit... | Non-democracy | Military Dict | 1966 | 3 | 0 |

740 | India | 750 | 750.0 | Southern Asia | Asia | Chandra Shekhar | Chandra Shekhar.India.1990.1990.Parliamentary Dem | Democracy | Parliamentary Dem | 1990 | 1 | 1 |

220 | Bulgaria | 355 | 355.0 | Eastern Europe | Europe | Todor Zhivkov | Todor Zhivkov.Bulgaria.1954.1988.Civilian Dict | Non-democracy | Civilian Dict | 1954 | 35 | 1 |

772 | Ireland | 205 | 205.0 | Northern Europe | Europe | Charles Haughey | Charles Haughey.Ireland.1979.1980.Mixed Dem | Democracy | Mixed Dem | 1979 | 2 | 1 |

1718 | United States of America | 2 | 2.0 | Northern America | Americas | Gerald Ford | Gerald Ford.United States of America.1974.1976... | Democracy | Presidential Dem | 1974 | 3 | 1 |

712 | Iceland | 395 | 395.0 | Northern Europe | Europe | Stefan Stefansson | Stefan Stefansson.Iceland.1947.1948.Mixed Dem | Democracy | Mixed Dem | 1947 | 2 | 1 |

6 rows × 12 columns

From the `lifelines`

library, we’ll need the
`KaplanMeierFitter`

for this exercise:

```
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
```

Note

Other ways to estimate the survival function in lifelines are `BreslowFlemingHarringtonFitter`

, `WeibullFitter`

, `ExponentialFitter`

For this estimation, we need the duration each leader was/has been in office, and whether or not they were observed to have left office (leaders who died in office or were in office in 2008, the latest date this data was record at, do not have observed death events)

We next use the `KaplanMeierFitter`

method `fit`

to fit the model to
the data. (This is similar to, and inspired by,
scikit-learn’s
fit/predict API)

```
KaplanMeierFitter.fit(durations, event_observed=None,
timeline=None, entry=None, label='KM_estimate',
alpha=None, left_censorship=False, ci_labels=None)
Parameters:
duration: an array, or pd.Series, of length n -- duration subject was observed for
timeline: return the best estimate at the values in timelines (postively increasing)
event_observed: an array, or pd.Series, of length n -- True if the the death was observed, False if the event
was lost (right-censored). Defaults all True if event_observed==None
entry: an array, or pd.Series, of length n -- relative time when a subject entered the study. This is
useful for left-truncated (not left-censored) observations. If None, all members of the population
were born at time 0.
label: a string to name the column of the estimate.
alpha: the alpha value in the confidence intervals. Overrides the initializing
alpha for this call to fit only.
left_censorship: True if durations and event_observed refer to left censorship events. Default False
ci_labels: add custom column names to the generated confidence intervals
as a length-2 list: [<lower-bound name>, <upper-bound name>]. Default: <label>_lower_<alpha>
Returns:
a modified self, with new properties like 'survival_function_'.
```

Below we fit our data with the `KaplanMeierFitter`

:

```
T = data["duration"]
E = data["observed"]
kmf.fit(T, event_observed=E)
```

```
<lifelines.KaplanMeierFitter: fitted with 1808 observations, 340 censored>
```

After calling the `fit`

method, the `KaplanMeierFitter`

has a property
called `survival_function_`

. (Again, we follow the styling of
scikit-learn, and append an underscore to all properties that were computational estimated)
The property is a Pandas DataFrame, so we can call `plot`

on it:

```
kmf.survival_function_.plot()
plt.title('Survival function of political regimes');
```

How do we interpret this? The y-axis represents the probability a leader is still
around after \(t\) years, where \(t\) years is on the x-axis. We
see that very few leaders make it past 20 years in office. Of course,
like all good stats, we need to report how uncertain we are about these
point estimates, i.e., we need confidence intervals. They are computed in
the call to `fit`

, and located under the `confidence_interval_`

property. (The method uses exponential Greenwood confidence interval. The mathematics are found in these notes.)

Alternatively, we can call `plot`

on the `KaplanMeierFitter`

itself
to plot both the KM estimate and its confidence intervals:

```
kmf.plot()
```

Note

Don’t like the shaded area for confidence intervals? See below for examples on how to change this.

The median time in office, which defines the point in time where on average 1/2 of the population has expired, is a property:

```
kmf.median_
# 4
#
```

Interesting that it is only three years. That means, around the world, elected leaders have a 50% chance of cessation in three years!

Let’s segment on democratic regimes vs non-democratic regimes. Calling
`plot`

on either the estimate itself or the fitter object will return
an `axis`

object, that can be used for plotting further estimates:

```
ax = plt.subplot(111)
dem = (data["democracy"] == "Democracy")
kmf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
kmf.plot(ax=ax, ci_force_lines=True)
kmf.fit(T[~dem], event_observed=E[~dem], label="Non-democratic Regimes")
kmf.plot(ax=ax, ci_force_lines=True)
plt.ylim(0, 1);
plt.title("Lifespans of different global regimes");
```

We might be interested in estimating the probabilities in between some
points. We can do that with the `timeline`

argument. We specify the
times we are interested in and are returned a DataFrame with the
probabilities of survival at those points:

```
ax = subplot(111)
t = np.linspace(0, 50, 51)
kmf.fit(T[dem], event_observed=E[dem], timeline=t, label="Democratic Regimes")
ax = kmf.plot(ax=ax)
print("Median survival time of democratic:", kmf.median_)
kmf.fit(T[~dem], event_observed=E[~dem], timeline=t, label="Non-democratic Regimes")
ax = kmf.plot(ax=ax)
print("Median survival time of non-democratic:", kmf.median_)
plt.ylim(0,1)
plt.title("Lifespans of different global regimes");
```

```
Median survival time of democratic: Democratic Regimes 3
dtype: float64
Median survival time of non-democratic: Non-democratic Regimes 6
dtype: float64
```

It is incredible how much longer these non-democratic regimes exist for. A democratic regime does have a natural bias towards death though: both via elections and natural limits (the US imposes a strict eight-year limit). The median of a non-democratic is only about twice as large as a democratic regime, but the difference is apparent in the tails: if you’re a non-democratic leader, and you’ve made it past the 10 year mark, you probably have a long life ahead. Meanwhile, a democratic leader rarely makes it past ten years, and then have a very short lifetime past that.

Here the difference between survival functions is very obvious, and
performing a statistical test seems pedantic. If the curves are more
similar, or we possess less data, we may be interested in performing a
statistical test. In this case, *lifelines* contains routines in
`lifelines.statistics`

to compare two survival curves. Below we
demonstrate this routine. The function `logrank_test`

is a common
statistical test in survival analysis that compares two event series’
generators. If the value returned exceeds some pre-specified value, then
we rule that the series have different generators.

```
from lifelines.statistics import logrank_test
results = logrank_test(T[dem], T[~dem], E[dem], E[~dem], alpha=.99)
results.print_summary()
```

```
Results
df: 1
alpha: 0.99
t 0: -1
test: logrank
null distribution: chi squared
__ p-value ___|__ test statistic __|____ test results ____|__ significant __
0.00000 | 208.306 | Reject Null | True
```

Lets compare the different *types* of regimes present in the dataset:

```
regime_types = data['regime'].unique()
for i,regime_type in enumerate(regime_types):
ax = plt.subplot(2, 3, i+1)
ix = data['regime'] == regime_type
kmf.fit( T[ix], E[ix], label=regime_type)
kmf.plot(ax=ax, legend=False)
plt.title(regime_type)
plt.xlim(0, 50)
if i==0:
plt.ylabel('Frac. in power after $n$ years')
plt.tight_layout()
```

##### Getting data into the right format¶

*lifelines* data format is consistent across all estimator class and
functions: an array of individual durations, and the individuals
event observation (if any). These are often denoted `T`

and `E`

respectively. For example:

```
T = [0,3,3,2,1,2]
E = [1,1,0,0,1,1]
kmf.fit(T, event_observed=E)
```

The raw data is not always available in this format – *lifelines*
includes some helper functions to transform data formats to *lifelines*
format. These are located in the `lifelines.utils`

sublibrary. For
example, the function `datetimes_to_durations`

accepts an array or
Pandas object of start times/dates, and an array or Pandas objects of
end times/dates (or `None`

if not observed):

```
from lifelines.utils import datetimes_to_durations
start_date = ['2013-10-10 0:00:00', '2013-10-09', '2013-10-10']
end_date = ['2013-10-13', '2013-10-10', None]
T, E = datetimes_to_durations(start_date, end_date, fill_date='2013-10-15')
print('T (durations): ', T)
print('E (event_observed): ', E)
```

```
T (durations): [ 3. 1. 5.]
E (event_observed): [ True True False]
```

The function `datetimes_to_durations`

is very flexible, and has many
keywords to tinker with.

##### Fitting to a Weibull model¶

Another very popular model for survival data is the Weibull model. In contrast the the Kaplan Meier estimator, this model is a *parametric model*, meaning it has a functional form with parameters that we are fitting the data to. (The Kaplan Meier estimator has no parameters to fit too). Mathematically, the survival function looks like:

..math:: S(t) = expleft(-(lambda t)^rhoright), lambda >0, rho > 0,

Apriori, we do not know what \(\lambda\) and \(\rho\) are, but we use the data on hand to estimate these parameters. In lifelines, this is implemented in the

`WeibullFitter`

:

```
from lifelines import WeibullFitter
T = data['duration']
E = data['observed']
wf = WeibullFitter()
wf.fit(T, E)
print(wf.lambda_, wf.rho_)
wf.print_summary()
```

#### Estimating hazard rates using Nelson-Aalen¶

The survival curve is a great way to summarize and visualize the
lifetime data, however it is not the only way. If we are curious about the hazard function \(\lambda(t)\) of a
population, we unfortunately cannot transform the Kaplan Meier estimate
– statistics doesn’t work quite that well. Fortunately, there is a
proper estimator of the *cumulative* hazard function:

The estimator for this quantity is called the Nelson Aalen estimator:

where \(d_i\) is the number of deaths at time \(t_i\) and \(n_i\) is the number of susceptible individuals.

In *lifelines*, this estimator is available as the `NelsonAalenFitter`

. Let’s use the regime dataset from above:

```
T = data["duration"]
E = data["observed"]
from lifelines import NelsonAalenFitter
naf = NelsonAalenFitter()
naf.fit(T,event_observed=E)
```

After fitting, the class exposes the property `cumulative_hazard_`

as
a DataFrame:

```
print(naf.cumulative_hazard_.head())
naf.plot()
```

```
NA-estimate
0 0.000000
1 0.325912
2 0.507356
3 0.671251
4 0.869867
[5 rows x 1 columns]
```

The cumulative hazard has less immediate understanding than the survival
curve, but the hazard curve is the basis of more advanced techniques in
survival analysis. Recall that we are estimating *cumulative hazard
curve*, \(\Lambda(t)\). (Why? The sum of estimates is much more
stable than the point-wise estimates.) Thus we know the *rate of change*
of this curve is an estimate of the hazard function.

Looking at figure above, it looks like the hazard starts off high and gets smaller (as seen by the decreasing rate of change). Let’s break the regimes down between democratic and non-democratic, during the first 20 years:

Note

We are using the `loc`

argument in the call to `plot`

here: it accepts a `slice`

and plots only points within that slice.

```
naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot(loc=slice(0, 20))
naf.fit(T[~dem], event_observed=E[~dem], label="Non-democratic Regimes")
naf.plot(ax=ax, loc=slice(0, 20))
plt.title("Cumulative hazard function of different global regimes");
```

Looking at the rates of change, I would say that both political
philosophies have a constant hazard, albeit democratic regimes have a
much *higher* constant hazard. So why did the combination of both
regimes have a *decreasing* hazard? This is the effect of *frailty*, a
topic we will discuss later.

##### Smoothing the hazard curve¶

Interpretation of the cumulative hazard function can be difficult – it is not how we usually interpret functions. (On the other hand, most survival analysis is done using the cumulative hazard function, so understanding it is recommended).

Alternatively, we can derive the more-interpretable hazard curve, but
there is a catch. The derivation involves a kernel smoother (to smooth
out the differences of the cumulative hazard curve) , and this requires
us to specify a bandwidth parameter that controls the amount of
smoothing. This functionality is in the `smoothed_hazard_`

and `hazard_confidence_intervals_`

methods. (Why methods? They require
an argument representing the bandwidth).

There is also a `plot_hazard`

function (that also requires a
`bandwidth`

keyword) that will plot the estimate plus the confidence
intervals, similar to the traditional `plot`

functionality.

```
b = 3.
naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot_hazard(bandwidth=b)
naf.fit(T[~dem], event_observed=E[~dem], label="Non-democratic Regimes")
naf.plot_hazard(ax=ax, bandwidth=b)
plt.title("Hazard function of different global regimes | bandwidth=%.1f"%b);
plt.ylim(0, 0.4)
plt.xlim(0, 25);
```

It is more clear here which group has the higher hazard, and like hypothesized above, both hazard rates are close to being constant.

There is no obvious way to choose a bandwidth, and different bandwidths produce different inferences, so it’s best to be very careful here. (My advice: stick with the cumulative hazard function.)

```
b = 8.
naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot_hazard(bandwidth=b)
naf.fit(T[~dem], event_observed=E[~dem], label="Non-democratic Regimes")
naf.plot_hazard(ax=ax, bandwidth=b)
plt.title("Hazard function of different global regimes | bandwidth=%.1f"%b);
```

#### Other types of censorship¶

##### Left Censored Data¶

We’ve mainly been focusing on *right-censorship*, which describes cases where we do not observe the death event.
This situation is the most common one. Alternatively, there are situations where we do not observe the *birth* event
occurring. Consider the case where a doctor sees a delayed onset of symptoms of an underlying disease. The doctor
is unsure *when* the disease was contracted (birth), but knows it was before the discovery.

Another situation where we have left-censored data is when measurements have only an upper bound, that is, the measurements
instruments could only detect the measurement was *less* than some upper bound.

*lifelines* has support for left-censored datasets in the `KaplanMeierFitter`

class, by adding the keyword `left_censorship=True`

(default `False`

) to the call to `fit`

.

```
from lifelines.datasets import load_lcd
lcd_dataset = load_lcd()
ix = lcd_dataset['group'] == 'alluvial_fan'
T = lcd_dataset[ix]['T']
E = lcd_dataset[ix]['E'] #boolean array, True if observed.
kmf = KaplanMeierFitter()
kmf.fit(T, E, left_censorship=True)
```

Instead of producing a survival function, left-censored data is more interested in the cumulative density function
of time to birth. This is available as the `cumulative_density_`

property after fitting the data.

```
print(kmf.cumulative_density_)
kmf.plot() #will plot the CDF
```

##### Left Truncated Data¶

Another form of bias that is introduced into a dataset is called left-truncation. (Also a form of censorship).
Left-truncation occurs when individuals may die even before ever entering into the study. Both `KaplanMeierFitter`

and `NelsonAalenFitter`

have an optional argument for `entry`

, which is an array of equal size to the duration array.
It describes the offset from birth to entering the study. This is also useful when subjects enter the study at different
points in their lifetime. For example, if you are measuring time to death of prisoners in
prison, the prisoners will enter the study at different ages.

Note

Nothing changes in the duration array: it still measures time from entry of study to time left study (either by death or censorship)

Note

Other types of censorship, like interval-censorship, are not implemented in

lifelinesyet.

### Survival Regression¶

Often we have additional data aside from the duration, and if applicable any censorships that occurred. In the regime dataset, we have the type of government the political leader was part of, the country they were head of, and the year they were elected. Can we use this data in survival analysis?

Yes, the technique is called *survival regression* – the name implies
we regress covariates (e.g., year elected, country, etc.) against a
another variable – in this case durations and lifetimes. Similar to the
logic in the first part of this tutorial, we cannot use traditional
methods like linear regression.

There are two popular competing techniques in survival regression: Cox’s model and Aalen’s additive model. Both models attempt to represent the hazard rate \(\lambda(t | x)\) as a function of \(t\) and some covariates \(x\). In Cox’s model, the relationship is defined:

On the other hand, Aalen’s additive model assumes the following form:

#### Cox’s Proportional Hazard model¶

Lifelines has an implementation of the Cox propotional hazards regression model (implemented in
R under `coxph`

). The idea behind the model is that the log-hazard of an individual is a linear function of their static covariates *and* a population-level baseline hazard that changes over time. Mathematically:

Note a few facts about this model: the only time component is in the baseline hazard, \(b_0(t)\). In the above product, the second term is only a scalar factor that only increases or decreases the baseline hazard. Thus a change in a covariate will only increase or decrease this baseline hazard.

##### Lifelines implementation¶

The implementation of the Cox model in lifelines, called `CoxPHFitter`

has a similar API to `AalensAdditiveFitter`

. Like R, it has a `print_summary`

function that prints a tabular view of coefficients and related stats.

This example data is from the paper here, avaible as `load_rossi`

in lifelines.

```
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
rossi_dataset = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi_dataset, duration_col='week', event_col='arrest')
cph.print_summary() # access the results using cph.summary
"""
n=432, number of events=114
coef exp(coef) se(coef) z p lower 0.95 upper 0.95
fin -0.3790 0.6845 0.1914 -1.9806 0.0476 -0.7542 -0.0039 *
age -0.0572 0.9444 0.0220 -2.6042 0.0092 -0.1003 -0.0142 **
race 0.3141 1.3691 0.3080 1.0198 0.3078 -0.2897 0.9180
wexp -0.1511 0.8597 0.2121 -0.7124 0.4762 -0.5670 0.2647
mar -0.4328 0.6487 0.3818 -1.1335 0.2570 -1.1813 0.3157
paro -0.0850 0.9185 0.1957 -0.4341 0.6642 -0.4687 0.2988
prio 0.0911 1.0954 0.0286 3.1824 0.0015 0.0350 0.1472 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Concordance = 0.640
"""
```

To access the coefficients and the baseline hazard directly, you can use `cph.hazards_`

and `cph.baseline_hazard_`

respectively.

##### Convergence¶

Fitting the Cox model to the data involves using gradient descent. Lifelines takes extra effort to help with convergence. If you wish to see the fitting, there is a `show_progress`

parameter in `CoxPHFitter.fit`

function. For further help, see Problems with convergence in the Cox Proportional Hazard Model.

After fitting, the value of the maximum log-likelihood this available using `cph._log_likelihood`

. Similarly, the score and Hessian matrix are available under `_score_`

and `_hessian_`

respectively. The `_hessian_`

can be used the find the covariance matrix of the coefficients.

##### Goodness of fit and prediction¶

After fitting, you may want to know how “good” of a fit your model was to the data. Aside from traditional approaches, two methods the author has found useful is to 1. look at the concordance-index (see below section on Model Selection in Survival Regression), available as `cph.score_`

or in the `print_summary`

and 2. compare spread between the baseline survival function vs the Kaplan Meier survival function (Why? a small spread between these two curves means that the impact of the exponential in the Cox model does very little, whereas a large spread means *most* of the changes in individual hazard can be attributed to the exponential term). For example, the first figure below is a good fit, and the second figure is a much weaker fit.

After fitting, you can use use the suite of prediction methods (similar to Aalen’s additive model): `.predict_partial_hazard`

, `.predict_survival_function`

, etc.

```
X = rossi_dataset.drop(["week", "arrest"], axis=1)
cph.predict_partial_hazard(X)
cph.predict_survival_function(X)
```

##### Plotting the coefficients¶

With a fitted model, an altervative way to view the coefficients and their ranges is to use the `plot`

method.

```
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
rossi_dataset = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi_dataset, duration_col='week', event_col='arrest')
cph.plot()
```

##### Plotting the effect of varying a covariate¶

After fitting, we can plot what the survival curves look like as we vary a single covarite while
holding everything else equal. This is useful to understand the impact of a covariate, *given the model*. To do this, we use the `plot_covariate_groups`

method and give it the covariate of interest, and the values to display.

```
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
rossi_dataset = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi_dataset, duration_col='week', event_col='arrest')
cph.plot_covariate_groups('prio', [0, 5, 10, 15])
```

##### Checking the proportional hazards assumption¶

A quick and visual way to check the proportional hazards assumption of a variable is to plot the survival curves segmented by the values of the variable. If the survival curves are the same “shape” and differ only by a constant factor, then the assumption holds. A more clear way to see this is to plot what’s called the logs curve: the loglogs (-log(survival curve)) vs log(time). If the curves are parallel (and hence do not cross each other), then it’s likely the variable satisfies the assumption. If the curves do cross, likely you’ll have to “stratify” the variable (see next section). In lifelines, the `KaplanMeierFitter`

object has a `.plot_loglogs`

function for this purpose.

The following is the loglogs curves of two variables in our regime dataset. The first is the democracy type, which does have (close to) parallel lines, hence satisfies our assumption:

```
from lifelines.datasets import load_dd
from lifelines import KaplanMeierFitter
data = load_dd()
democracy_0 = data.loc[data['democracy'] == 'Non-democracy']
democracy_1 = data.loc[data['democracy'] == 'Democracy']
kmf0 = KaplanMeierFitter()
kmf0.fit(democracy_0['duration'], event_observed=democracy_0['observed'])
kmf1 = KaplanMeierFitter()
kmf1.fit(democracy_1['duration'], event_observed=democracy_1['observed'])
fig, axes = plt.subplots()
kmf0.plot_loglogs(ax=axes)
kmf1.plot_loglogs(ax=axes)
axes.legend(['Non-democracy', 'Democracy'])
plt.show()
```

The second variable is the regime type, and this variable does not follow the proportional hazards assumption.

##### Stratification¶

Sometimes a covariate may not obey the proportional hazard assumption. In this case, we can allow a factor without estimating its effect to be adjusted. To specify categorical variables to be used in stratification, we define them in the call to `fit`

:

```
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
rossi_dataset = load_rossi()
cph.fit(rossi_dataset, 'week', event_col='arrest', strata=['race'])
cph.print_summary() # access the results using cph.summary
"""
n=432, number of events=114
coef exp(coef) se(coef) z p lower 0.95 upper 0.95
fin -0.3775 0.6856 0.1913 -1.9731 0.0485 -0.7525 -0.0024 *
age -0.0573 0.9443 0.0220 -2.6081 0.0091 -0.1004 -0.0142 **
wexp -0.1435 0.8664 0.2127 -0.6746 0.4999 -0.5603 0.2734
mar -0.4419 0.6428 0.3820 -1.1570 0.2473 -1.1907 0.3068
paro -0.0839 0.9196 0.1958 -0.4283 0.6684 -0.4677 0.3000
prio 0.0919 1.0962 0.0287 3.1985 0.0014 0.0356 0.1482 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Concordance = 0.638
"""
```

#### Aalen’s Additive model¶

Warning

This implementation is still experimental.

The estimator to fit unknown coefficients in Aalen’s additive model is
located in `estimators`

under `AalenAdditiveFitter`

. For this
exercise, we will use the regime dataset and include the categorical
variables `un_continent_name`

(eg: Asia, North America,…), the
`regime`

type (e.g., monarchy, civilian,…) and the year the regime
started in, `start_year`

.

Aalen’s additive model typically does not estimate the individual
\(b_i(t)\) but instead estimates \(\int_0^t b_i(s) \; ds\)
(similar to the estimate of the hazard rate using `NelsonAalenFitter`

above). This is important to keep in mind when analyzing the output.
.. code:: python

from lifelines import AalenAdditiveFitter from lifelines.datasets import load_dd

data = load_dd() data.head()

ctryname | cowcode2 | politycode | un_region_name | un_continent_name | ehead | leaderspellreg | democracy | regime | start_year | duration | observed |
---|---|---|---|---|---|---|---|---|---|---|---|

Afghanistan | 700 | 700 | Southern Asia | Asia | Mohammad Zahir Shah | Mohammad Zahir Shah.Afghanistan.1946.1952.Mona... | Non-democracy | Monarchy | 1946 | 7 | 1 |

Afghanistan | 700 | 700 | Southern Asia | Asia | Sardar Mohammad Daoud | Sardar Mohammad Daoud.Afghanistan.1953.1962.Ci... | Non-democracy | Civilian Dict | 1953 | 10 | 1 |

Afghanistan | 700 | 700 | Southern Asia | Asia | Mohammad Zahir Shah | Mohammad Zahir Shah.Afghanistan.1963.1972.Mona... | Non-democracy | Monarchy | 1963 | 10 | 1 |

Afghanistan | 700 | 700 | Southern Asia | Asia | Sardar Mohammad Daoud | Sardar Mohammad Daoud.Afghanistan.1973.1977.Ci... | Non-democracy | Civilian Dict | 1973 | 5 | 0 |

Afghanistan | 700 | 700 | Southern Asia | Asia | Nur Mohammad Taraki | Nur Mohammad Taraki.Afghanistan.1978.1978.Civi... | Non-democracy | Civilian Dict | 1978 | 1 | 0 |

5 rows × 12 columns

I’m using the lovely library patsy here to create a covariance matrix from my original dataframe.

```
import patsy
X = patsy.dmatrix('un_continent_name + regime + start_year', data, return_type='dataframe')
X = X.rename(columns={'Intercept': 'baseline'})
```

```
X.columns.tolist()
```

```
['baseline',
'un_continent_name[T.Americas]',
'un_continent_name[T.Asia]',
'un_continent_name[T.Europe]',
'un_continent_name[T.Oceania]',
'regime[T.Military Dict]',
'regime[T.Mixed Dem]',
'regime[T.Monarchy]',
'regime[T.Parliamentary Dem]',
'regime[T.Presidential Dem]',
'start_year']
```

We have also included the `coef_penalizer`

option. During the estimation, a
linear regression is computed at each step. Often the regression can be
unstable (due to high
co-linearity
or small sample sizes) – adding a penalizer term controls the stability. I recommend always starting with a small penalizer term – if the estimates still appear to be too unstable, try increasing it.

```
aaf = AalenAdditiveFitter(coef_penalizer=1.0)
```

An instance of `AalenAdditiveFitter`

includes a `fit`

method that performs the inference on the coefficients. This method accepts a pandas DataFrame: each row is an individual and columns are the covariates and
two individual columns: a *duration* column and a boolean *event occurred* column (where event occurred refers to the event of interest - expulsion from government in this case)

```
X['T'] = data['duration']
X['E'] = data['observed']
```

```
aaf.fit(X, 'T', event_col='E')
```

After fitting, the instance exposes a `cumulative_hazards_`

DataFrame
containing the estimates of \(\int_0^t b_i(s) \; ds\):

```
aaf.cumulative_hazards_.head()
```

un_continent_name[Africa] | un_continent_name[Americas] | un_continent_name[Asia] | un_continent_name[Europe] | un_continent_name[Oceania] | regime[T.Military Dict] | regime[T.Mixed Dem] | regime[T.Monarchy] | regime[T.Parliamentary Dem] | regime[T.Presidential Dem] | start_year | baseline |
---|---|---|---|---|---|---|---|---|---|---|---|

-0.051595 | -0.082406 | 0.010666 | 0.154493 | -0.060438 | 0.075333 | 0.086274 | -0.133938 | 0.048077 | 0.127171 | 0.000116 | -0.029280 |

-0.014713 | -0.039471 | 0.095668 | 0.194251 | -0.092696 | 0.115033 | 0.358702 | -0.226233 | 0.168783 | 0.121862 | 0.000053 | 0.143039 |

0.007389 | -0.064758 | 0.115121 | 0.170549 | 0.069371 | 0.161490 | 0.677347 | -0.271183 | 0.328483 | 0.146234 | 0.000004 | 0.297672 |

-0.058418 | 0.011399 | 0.091784 | 0.205824 | 0.125722 | 0.220028 | 0.932674 | -0.294900 | 0.365604 | 0.422617 | 0.000002 | 0.376311 |

-0.099282 | 0.106641 | 0.112083 | 0.150708 | 0.091900 | 0.241575 | 1.123860 | -0.391103 | 0.536185 | 0.743913 | 0.000057 | 0.362049 |

5 rows × 12 columns

`AalenAdditiveFitter`

also has built in plotting:

```
aaf.plot(columns=['regime[T.Presidential Dem]', 'baseline', 'un_continent_name[T.Europe]'], iloc=slice(1,15))
```

Regression is most interesting if we use it on data we have not yet seen, i.e., prediction! We can use what we have learned to predict individual hazard rates, survival functions, and median survival time. The dataset we are using is available up until 2008, so let’s use this data to predict the duration of former Canadian Prime Minister Stephen Harper.

```
ix = (data['ctryname'] == 'Canada') & (data['start_year'] == 2006)
harper = X.loc[ix]
print("Harper's unique data point:")
print(harper)
```

```
Harper's unique data point
```

```
array([[ 0., 0., 1., 0., 0., 0., 0., 1.,
0., 0., 2003.]])
```

```
ax = plt.subplot(2,1,1)
aaf.predict_cumulative_hazard(harper).plot(ax=ax)
ax = plt.subplot(2,1,2)
aaf.predict_survival_function(harper).plot(ax=ax);
```

Warning

Because of the nature of the model, estimated survival functions of individuals can increase. This is an expected artifact of Aalen’s additive model.

#### Cox’s Time Varying Proportional Hazard model¶

Warning

This implementation is still experimental.

Often an individual will have a covariate change over time. An example of this is hospital patients who enter the study and, at some future time, may recieve a heart transplant. We would like to know the effect of the transplant, but we cannot condition on whether they recieved the transplant naively. Consider that if patients needed to wait at least 1 year before getting a transplant, then everyone who dies before that year is considered as a non-transplant patient, and hence this would overestimate the hazard of not recieving a transplant.

We can incorporate changes over time into our survival analysis by using a modification of the Cox model above. The general mathematical description is:

Note the time-varying \(x_i(t)\) to denote that covariates can change over time. This model is implemented in lifelines as `CoxTimeVaryingFitter`

. The dataset schema required is different than previous models, so we will spend some time describing this.

##### Dataset for time-varying regression¶

Lifelines requires that the dataset be in what is called the *long* format. This looks like one row per state change, including an ID, the left (exclusive) time point, and right (inclusive) time point. For example, the following dataset tracks three unique subjects.

id | start | stop | group | z | event |
---|---|---|---|---|---|

1 | 0 | 8 | 1 | 0 | False |

2 | 0 | 5 | 0 | 0 | False |

2 | 5 | 8 | 0 | 1 | True |

3 | 0 | 3 | 1 | 0 | False |

3 | 3 | 12 | 1 | 1 | True |

5 rows × 6 columns

In the above dataset, `start`

and `stop`

denote the boundaries, `id`

is the unique identifier per subject, and `event`

denotes if the subject died at the end of that period. For example, subject ID 2 had variable `z=0`

up to and including the end of time period 5 (we can think that measurements happen at end of the time period.), after which it was set to 1.

So if this is the desired dataset, it can be built up first from smaller datasets. To do this we can use some helper functions provided in lifelines.

Typically, data will be in a format that looks like it comes out of a relational database. You may have a “base” table with ids, durations, and a censorsed flag, and possibly static covariates. Ex:

id | duration | event | var1 |
---|---|---|---|

1 | 10 | True | 0.1 |

2 | 12 | False | 0.5 |

2 rows × 4 columns

You’ll also have secondary dataset that reference taking future measurements. Example:

id | time | var2 |
---|---|---|

1 | 0 | 1.4 |

1 | 4 | 1.2 |

1 | 8 | 1.5 |

2 | 0 | 1.6 |

4 rows × 3 columns

where `time`

is the duration from the entry event. Here we see subject 1 had a change in their `var2`

covariate at the end of time 4 and at the end of time 8. We can use `to_long_format`

to transform the base dataset into a long format and `add_covariate_to_timeline`

to fold the covariate dataset into the original dataset.

```
from lifelines.utils import to_long_format
from lifelines.utils import add_covariate_to_timeline
base_df = to_long_format(base_df, duration_col="T")
df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="event")
```

id | start | stop | var1 | var2 | event |
---|---|---|---|---|---|

1 | 0 | 4 | 0.1 | 1.4 | False |

1 | 4 | 8 | 0.1 | 1.2 | False |

1 | 8 | 10 | 0.1 | 1.5 | True |

2 | 0 | 12 | 0.5 | 1.6 | False |

4 rows × 6 columns

From the above output, we can see that subject 1 changed state twice over the observation period, finally expiring at the end of time 10. Subject 2 was a censored case, and we lost them after time 2.

You may have multiple covariates you wish to add, so the above could be streamlined like so:

```
from lifelines.utils import to_long_format
from lifelines.utils import add_covariate_to_timeline
base_df = to_long_format(base_df, duration_col="T")
df = base_df.pipe(add_covariate_to_timeline, cv1, duration_col="time", id_col="id", event_col="event")\
.pipe(add_covariate_to_timeline, cv2, duration_col="time", id_col="id", event_col="event")\
.pipe(add_covariate_to_timeline, cv3, duration_col="time", id_col="id", event_col="event")
```

One additional flag on `add_covariate_to_timeline`

that is of interest is the `cumulative_sum`

flag. By default it is False, but turning it to True will perform a cumulative sum on the covariate before joining. This is useful if the covariates describe an incremental change, instead of a state update. For example, we may have measurements of drugs administered to a patient, and we want to the covariate to reflect how much we have administered since the start. In contrast, a covariate the measure the temperature of the patient is a state update. See Example cumulative total using ``add_covariate_to_timeline`` to see an example of this.

For an example of pulling datasets like this from a SQL-store, and other helper functions, see Example SQL queries and transformations to get time varying data.

##### Fitting the model and a short note on prediction¶

Once your dataset is in the correct orientation, we can use `CoxTimeVaryingFitter`

to fit the model to your data.

```
from lifelines import CoxTimeVaryingFitter
ctv = CoxTimeVaryingFitter()
ctv.fit(df, id_col="id", event_col="event", start_col="start", stop_col="stop")
ctv.print_summary()
ctv.plot()
```

Unlike the other regression models, prediction in a time-varying setting is not possible normally. To predict, we would need to know the covariates values beyond the current time, but if we knew that, we would also know if the subject was still alive or not. For this reason, there are no prediction methods attached to `CoxTimeVaryingFitter`

.

#### Model Selection in Survival Regression¶

If censorship is present, it’s not appropriate to use a loss function like mean-squared-error or mean-absolute-loss. Instead, one measure is the concordance-index, also known as the c-index. This measure evaluates the accuracy of the ordering of predicted time. It is infact a generalization of AUC, another common loss function, and is interpreted similarly:

- 0.5 is the expected result from random predictions,
- 1.0 is perfect concordance and,
- 0.0 is perfect anti-concordance (multiply predictions with -1 to get 1.0)

The measure is implemented in lifelines under lifelines.utils.concordance_index and accepts the actual times (along with any censorships) and the predicted times.

##### Cross Validation¶

Lifelines has an implementation of k-fold cross validation under lifelines.utils.k_fold_cross_validation. This function accepts an instance of a regression fitter (either `CoxPHFitter`

of `AalenAdditiveFitter`

), a dataset, plus k (the number of folds to perform, default 5). On each fold, it splits the data
into a training set and a testing set fits itself on the training set and evaluates itself on the testing set (using the concordance measure).

```
from lifelines import CoxPHFitter
from lifelines.datasets import load_regression_dataset
from lifelines.utils import k_fold_cross_validation
regression_dataset = load_regression_dataset()
cph = CoxPHFitter()
scores = k_fold_cross_validation(cph, regression_dataset, 'T', event_col='E', k=3)
print(scores)
print(np.mean(scores))
print(np.std(scores))
#[ 0.5896 0.5358 0.5028]
# 0.542
# 0.035
```

### More Examples and Recipes¶

This section goes through some examples and recipes to help you use lifelines.

#### Statistically compare two populations¶

(though this applies just as well to Nelson-Aalen estimates). Often researchers want to compare survival curves between different populations. Here are some techniques to do that:

##### Subtract the difference between survival curves¶

If you are interested in taking the difference between two survival curves, simply trying to
subtract the `survival_function_`

will likely fail if the DataFrame’s indexes are not equal. Fortunately,
the `KaplanMeierFitter`

and `NelsonAalenFitter`

have a built-in `subtract`

method:

```
kmf1.subtract(kmf2)
```

will produce the difference at every relevant time point. A similar function exists for division: `divide`

.

##### Compare using a hypothesis test¶

For rigorous testing of differences, lifelines comes with a statistics library. The `logrank_test`

function
compares whether the “death” generation process of the two populations are equal:

```
from lifelines.statistics import logrank_test
results = logrank_test(T1, T2, event_observed_A=E1, event_observed_B=E2)
results.print_summary()
"""
df=1, alpha=0.95, t0=-1, test=logrank, null distribution=chi squared
test_statistic p
3.528 0.00034 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
"""
print(results.p_value) # 0.46759
print(results.test_statistic) # 0.528
print(results.is_significant) # False
```

If you have more than two populations, you can use `pairwise_logrank_test`

(which compares
each pair in the same manner as above), or `multivariate_logrank_test`

(which tests the
hypothesis that all the populations have the same “death” generation process).

#### Model selection using lifelines¶

If using *lifelines* for prediction work, it’s ideal that you perform some type of cross-validation scheme. This cross-validation allows you to be confident that your out-of-sample predictions will work well in practice. It also allows you to choose between multiple models.

*lifelines* has a built-in k-fold cross-validation function. For example, consider the following example:

```
from lifelines import AalenAdditiveFitter, CoxPHFitter
from lifelines.datasets import load_regression_dataset
from lifelines.utils import k_fold_cross_validation
df = load_regression_dataset()
#create the three models we'd like to compare.
aaf_1 = AalenAdditiveFitter(coef_penalizer=0.5)
aaf_2 = AalenAdditiveFitter(coef_penalizer=10)
cph = CoxPHFitter()
print(np.mean(k_fold_cross_validation(cph, df, duration_col='T', event_col='E')))
print(np.mean(k_fold_cross_validation(aaf_1, df, duration_col='T', event_col='E')))
print(np.mean(k_fold_cross_validation(aaf_2, df, duration_col='T', event_col='E')))
```

From these results, Aalen’s Additive model with a penalizer of 10 is best model of predicting future survival times.

#### Displaying at-risk counts below plots¶

The function `add_at_risk_counts`

in `lifelines.plotting`

allows you to add At-Risk counts at the bottom of your figures. For example:

```
from numpy.random import exponential
T_control = exponential(10, size=250)
T_experiment = exponential(20, size=200)
ax = plt.subplot(111)
from lifelines import KaplanMeierFitter
kmf_control = KaplanMeierFitter()
ax = kmf_control.fit(T_control, label='control').plot(ax=ax)
kmf_exp = KaplanMeierFitter()
ax = kmf_exp.fit(T_experiment, label='experiment').plot(ax=ax)
from lifelines.plotting import add_at_risk_counts
add_at_risk_counts(kmf_exp, kmf_control, ax=ax)
```

will display

Alternatively, you can add this at the call to `plot`

: `kmf.plot(at_risk_counts=True)`

#### Transforming survival-table data into lifelines format¶

Some lifelines classes are designed for lists or arrays that represent one individual per row. If you instead have data in a *survival table* format, there exists a utility method to get it into lifelines format.

**Example:** Suppose you have a csv file with data that looks like this:

time (months, days, …) | observed deaths | censored |
---|---|---|

0 | 7 | 0 |

1 | 1 | 1 |

2 | 2 | 0 |

3 | 1 | 2 |

4 | 5 | 2 |

… | … | … |

```
import pandas as pd
from lifelines.utils import survival_events_from_table
df = pd.read_csv('file.csv', columns = ['observed deaths', 'censored'])
T, E = survival_events_from_table(df, observed_deaths_col='observed deaths', censored_col='censored')
print(T) # array([0,0,0,0,0,0,0,1,...])
print(E) # array([1,1,1,1,1,1,1,0,...])
```

#### Transforming observational data into survival-table format¶

Perhaps you are interested in viewing the survival table given some durations and censorship vectors.

```
from lifelines.utils import survival_table_from_events
table = survival_table_from_events(T, E)
print(table.head())
"""
removed observed censored entrance at_risk
event_at
0 0 0 0 60 60
2 2 1 1 0 60
3 3 1 2 0 58
4 5 3 2 0 55
5 12 6 6 0 50
"""
```

#### Plotting multiple figures on a plot¶

When .plot is called, an axis object is returned which can be passed into future calls of .plot:

```
kmf.fit(data1)
ax = kmf.plot()
kmf.fit(data2)
ax = kmf.plot(ax=ax)
```

If you have a pandas DataFrame with columns “group”, “T”, and “E”, then something like the following would work:

```
from lifelines import KaplanMeierFitter
from matplotlib import pyplot as plt
ax = plt.subplot(111)
kmf = KaplanMeierFitter()
for group in df['group'].unique():
data = grouped_data.get_group(group)
kmf.fit(data["T"], data["E"], label=group)
kmf.plot(ax=ax)
```

#### Plotting options and styles¶

#### Set the index/timeline of a estimate¶

Suppose your dataset has lifetimes grouped near time 60, thus after fitting KaplanMeierFitter, you survival function might look something like:

```
print(kmf.survival_function_)
KM-estimate
0 1.00
47 0.99
49 0.97
50 0.96
51 0.95
52 0.91
53 0.86
54 0.84
55 0.79
56 0.74
57 0.71
58 0.67
59 0.58
60 0.49
61 0.41
62 0.31
63 0.24
64 0.19
65 0.14
66 0.10
68 0.07
69 0.04
70 0.02
71 0.01
74 0.00
```

What you would like is to have a predictable and full index from 40 to 75. (Notice that in the above index, the last two time points are not adjacent – the cause is observing no lifetimes existing for times 72 or 73). This is especially useful for comparing multiple survival functions at specific time points. To do this, all fitter methods accept a timeline argument:

```
kmf.fit(T, timeline=range(40,75))
print(kmf.survival_function_)
KM-estimate
40 1.00
41 1.00
42 1.00
43 1.00
44 1.00
45 1.00
46 1.00
47 0.99
48 0.99
49 0.97
50 0.96
51 0.95
52 0.91
53 0.86
54 0.84
55 0.79
56 0.74
57 0.71
58 0.67
59 0.58
60 0.49
61 0.41
62 0.31
63 0.24
64 0.19
65 0.14
66 0.10
67 0.10
68 0.07
69 0.04
70 0.02
71 0.01
72 0.01
73 0.01
74 0.00
```

Lifelines will intelligently forward-fill the estimates to unseen time points.

#### Example SQL query to get survival data from a table¶

Below is a way to get an example dataset from a relational database (this may vary depending on your database):

```
SELECT
id,
DATEDIFF('dd', started_at, COALESCE(ended_at, CURRENT_DATE)) AS "T",
(ended_at IS NOT NULL) AS "E"
FROM table
```

##### Explanation¶

Each row is an id, a duration, and a boolean indicating whether the event occurred or not. Recall that we denote a
“True” if the event *did* occur, that is, ended_at is filled in (we observed the ended_at). Ex:

id | T | E |
---|---|---|

10 | 40 | True |

11 | 42 | False |

12 | 42 | False |

13 | 36 | True |

14 | 33 | True |

#### Example SQL queries and transformations to get time varying data¶

For Cox time-varying models, we discussed what the dataset should look like in Dataset for time-varying regression. Typically we have a base dataset, and then we fold in the covariate datasets. Below are some SQL queries and Python transformations from end-to-end.

##### Base dataset: `base_df`

¶

```
SELECT
id,
group,
DATEDIFF('dd', dt.started_at, COALESCE(dt.ended_at, CURRENT_DATE)) AS "T",
(ended_at IS NOT NULL) AS "E"
FROM dimension_table dt
```

##### Time-varying variables¶

```
-- this could produce more than 1 row per subject
SELECT
id,
DATEDIFF('dd', dt.started_at, ft.event_at) AS "time",
ft.var1
FROM fact_table ft
JOIN dimension_table dt
USING(id)
```

```
from lifelines.utils import to_long_format
from lifelines.utils import add_covariate_to_timeline
base_df = to_long_format(base_df, duration_col="T")
df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="E")
```

##### Event variables¶

Another very common operation is to fold in event data. For example, a dataset that contains information about the dates of an event (and NULLS if the event didn’t occur). For example:

```
SELECT
id,
DATEDIFF('dd', dt.started_at, ft.event1_at) AS "E1",
DATEDIFF('dd', dt.started_at, ft.event2_at) AS "E2",
DATEDIFF('dd', dt.started_at, ft.event3_at) AS "E3"
...
FROM dimension_table dt
```

In Pandas, this may look like:

```
id E1 E2 E3
0 1 1.0 NaN 2.0
1 2 NaN 5.0 NaN
2 3 3.0 5.0 7.0
...
```

Initially, this can’t be added to our baseline dataframe. Using `utils.covariates_from_event_matrix`

we can convert a dataframe like this into one that can be easily added.

```
from lifelines.utils import covariates_from_event_matrix
cv = covariates_from_event_matrix(df, 'id')
df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="E", cumulative_sum=True)
```

#### Example cumulative total using `add_covariate_to_timeline`

¶

Often we have either __transactional covariate datasets__ or __state covariate datasets__. In a transactional dataset, it may make sense to sum up the covariates to represent administration of a treatment over time. For example, in the risky world of start-ups, we may want to sum up the funding amount recieved at a certain time. We also may be interested in the amount of the last round of funding. Below is an example to do just that:

Suppose we have an initial DataFrame of start-ups like:

```
seed_df = pd.DataFrame.from_records([
{'id': 'FB', 'E': True, 'T': 12, 'funding': 0},
{'id': 'SU', 'E': True, 'T': 10, 'funding': 0},
])
```

And a covariate dataframe representing funding rounds like:

```
cv = pd.DataFrame.from_records([
{'id': 'FB', 'funding': 30, 't': 5},
{'id': 'FB', 'funding': 15, 't': 10},
{'id': 'FB', 'funding': 50, 't': 15},
{'id': 'SU', 'funding': 10, 't': 6},
{'id': 'SU', 'funding': 9, 't': 10},
])
```

We can do the following to get both the cumulative funding recieved and the latest round of funding:

```
from lifelines.utils import to_long_format
from lifelines.utils import add_covariate_to_timeline
df = seed_df.pipe(to_long_format, 'T')\
.pipe(add_covariate_to_timeline, cv, 'id', 't', 'E', cumulative_sum=True)\
.pipe(add_covariate_to_timeline, cv, 'id', 't', 'E', cumulative_sum=False)
"""
start cumsum_funding funding stop id E
0 0 0.0 0.0 5.0 FB False
1 5 30.0 30.0 10.0 FB False
2 10 45.0 15.0 12.0 FB True
3 0 0.0 0.0 6.0 SU False
4 6 10.0 10.0 10.0 SU False
5 10 19.0 9.0 10.0 SU True
"""
```

#### Sample size determination under a CoxPH model¶

Suppose you wish to measure the hazard ratio between two populations under the CoxPH model. That is, we want to evaluate the hypothesis
H0: relative hazard ratio = 1 vs H1: relative hazard ratio != 1, where the relative hazard ratio is \(\exp{\left(\beta\right)}\) for the experiment group vs the control group. Apriori, we are interested in the sample sizes of the two groups necessary to achieve a certain statistical power. To do this in lifelines, there is the `lifelines.statistics.sample_size_necessary_under_cph`

function. For example:

```
from lifelines.statistics import sample_size_necessary_under_cph
desired_power = 0.8
ratio_of_participants = 1.
p_exp = 0.25
p_con = 0.35
postulated_hazard_ratio = 0.7
n_exp, n_con = sample_size_necessary_under_cph(desired_power, ratio_of_participants, p_exp, p_con, postulated_hazard_ratio)
# (421, 421)
```

This assumes you have estimates of the probability of event occuring for both the experiment and control group. This could be determined from previous experiments.

#### Power determination under a CoxPH model¶

Suppose you wish to measure the hazard ratio between two populations under the CoxPH model. To determine the statistical power of a hazard ratio hypothesis test, under the CoxPH model, we can use `lifelines.statistics.power_under_cph`

. That is, suppose we want to know the probability that we reject the null hypothesis that the relative hazard ratio is 1, assuming the relative hazard ratio is truely different from 1. This function will give you that probability.

```
from lifelines.statistics import power_under_cph
n_exp = 50
n_con = 100
p_exp = 0.25
p_con = 0.35
postulated_hazard_ratio = 0.5
power = power_under_cph(n_exp, n_con, p_exp, p_con, postulated_hazard_ratio)
# 0.4957
```

#### Problems with convergence in the Cox Proportional Hazard Model¶

Since the estimation of the coefficients in the Cox proportional hazard model is done using the Newton-Raphson algorithm, there is sometimes a problem with convergence. Here are some common symptoms and possible resolutions:

`delta contains nan value(s). Convergence halted.`

: First try adding`show_progress=True`

in the`fit`

function. If the values in`delta`

grow unboundedly, it’s possible the`step_size`

is too large. Try setting it to a small value (0.1-0.5).`LinAlgError: Singular matrix`

: This means that there is a linear combination in your dataset. That is, a column is equal to the linear combination of 1 or more other columns. Try to find the relationship by looking at the correlation matrix of your dataset.- Some coefficients are many orders of magnitude larger than others, and the standard error of the coefficient is equally as large. __Or__ there are nan’s in the results. This can be seen using the
`summary`

method on a fitted`CoxPHFitter`

object.

- Look for a
`ConvergenceWarning`

about variances being too small. The dataset may contain a constant column, which provides no information for the regression (Cox model doesn’t have a traditional “intercept” term like other regression models).- The data is completely separable, which means that there exists a covariate the completely determines whether an event occurred or not. For example, for all “death” events in the dataset, there exists a covariate that is constant amongst all of them. Look for a
`ConvergenceWarning`

after the`fit`

call.- Related to above, the relationship between a covariate and the duration may be completely determined. For example, if the rank correlation between a covariate and the duration is very close to 1 or -1, then the log-likelihood can be increased arbitrarly using just that covariate. Look for a
`ConvergenceWarning`

after the`fit`

call.- Another problem may be a co-linear relationship in your dataset. See point 2. above.
- Adding a very small
`penalizer_coef`

significantly changes the results. This probably means that the step size is too large. Try decreasing it, and returning the`penalizer_coef`

term to 0.- If using the
`strata`

arugment, make sure your stratification group sizes are not too small. Try`df.groupby(strata).size()`

.

## Installation¶

Dependencies are from the typical Python data-stack: Numpy, Pandas, Scipy, and optionally Matplotlib. Install using:

pip install lifelines

## Source code and Issue Tracker¶

Available on Github, CamDavidsonPilon/lifelines Please report bugs, issues and feature extensions there. We also have Gitter channel open to discuss lifelines: