lifelinesÂ¶
lifelines is a implementation of survival analysis in Python. What benefits does lifelines offer over other survival analysis implementations?
 built on top of Pandas
 internal plotting methods
 simple and intuitive API
 only focus is survival analysis
Contents:Â¶
QuickstartÂ¶
KaplanMeier and NelsonAalenÂ¶
Note
For readers looking for an introduction to survival analysis, itâ€™s recommended to start at Introduction to survival analysis
Letâ€™s start by importing some data. We need the durations that individuals are observed for, and whether they â€śdiedâ€ť or not.
from lifelines.datasets import load_waltons
df = load_waltons() # returns a Pandas DataFrame
print(df.head())
"""
T E group
0 6 1 miR137
1 13 1 miR137
2 13 1 miR137
3 13 1 miR137
4 19 1 miR137
"""
T = df['T']
E = df['E']
T
is an array of durations, E
is a either boolean or binary array representing whether the â€śdeathâ€ť was observed (alternatively an individual can be censored).
Note
lifelines assumes all â€śdeathsâ€ť are observed unless otherwise specified.
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(T, event_observed=E) # or, more succinctly, kmf.fit(T, E)
After calling the fit
method, we have access to new properties like survival_function_
and methods like plot()
. The latter is a wrapper around Pandaâ€™s internal plotting library.
kmf.survival_function_
kmf.median_
kmf.plot()
Multiple groupsÂ¶
groups = df['group']
ix = (groups == 'miR137')
kmf.fit(T[~ix], E[~ix], label='control')
ax = kmf.plot()
kmf.fit(T[ix], E[ix], label='miR137')
ax = kmf.plot(ax=ax)
Alternatively, for many more groups and more â€śpandasesqueâ€ť:
ax = plt.subplot(111)
kmf = KaplanMeierFitter()
for name, grouped_df in df.groupby('group'):
kmf.fit(grouped_df["T"], grouped_df["E"], label=name)
kmf.plot(ax=ax)
Similar functionality exists for the NelsonAalenFitter
:
from lifelines import NelsonAalenFitter
naf = NelsonAalenFitter()
naf.fit(T, event_observed=E)
but instead of a survival_function_
being exposed, a cumulative_hazard_
is.
Note
Similar to ScikitLearn, all statistically estimated quantities append an underscore to the property name.
Note
More detailed docs about estimating the survivial function and cumulative hazard are available in Survival analysis with lifelines.
Getting data in the right formatÂ¶
Often youâ€™ll have data that looks like::
*start_time1*, *end_time1*
*start_time2*, *end_time2*
*start_time3*, None
*start_time4*, *end_time4*
lifelines has some utility functions to transform this dataset into duration and censoring vectors:
from lifelines.utils import datetimes_to_durations
# start_times is a vector or list of datetime objects or datetime strings
# end_times is a vector or list of (possibly missing) datetime objects or datetime strings
T, E = datetimes_to_durations(start_times, end_times, freq='h')
Perhaps you are interested in viewing the survival table given some durations and censoring vectors.
from lifelines.utils import survival_table_from_events
table = survival_table_from_events(T, E)
print(table.head())
"""
removed observed censored entrance at_risk
event_at
0 0 0 0 163 163
6 1 1 0 0 163
7 2 1 1 0 162
9 3 3 0 0 160
13 3 3 0 0 157
"""
Survival regressionÂ¶
While the above KaplanMeierFitter
and NelsonAalenFitter
are useful, they only give us an â€śaverageâ€ť view of the population. Often we have specific data at the individual level, either continuous or categorical, that we would like to use. For this, we turn to survival regression, specifically AalenAdditiveFitter
and CoxPHFitter
.
from lifelines.datasets import load_regression_dataset
regression_dataset = load_regression_dataset()
regression_dataset.head()
The input of the fit
methodâ€™s API in a regression is different. All the data, including durations, censorings and covariates must be contained in a Pandas DataFrame (yes, it must be a DataFrame). The duration column and event occurred column must be specified in the call to fit
.
from lifelines import CoxPHFitter
# Using Cox Proportional Hazards model
cph = CoxPHFitter()
cph.fit(regression_dataset, 'T', event_col='E')
cph.print_summary()
"""
<lifelines.CoxPHFitter: fitted with 200 observations, 11 censored>
duration col = 'T'
event col = 'E'
number of subjects = 200
number of events = 189
loglikelihood = 807.62
time fit was run = 20190127 23:11:22 UTC

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
var1 0.22 1.25 0.07 2.99 <0.005 8.49 0.08 0.37
var2 0.05 1.05 0.08 0.61 0.54 0.89 0.11 0.21
var3 0.22 1.24 0.08 2.88 <0.005 7.97 0.07 0.37

Concordance = 0.58
Likelihood ratio test = 15.54 on 3 df, log2(p)=9.47
"""
cph.plot()
If we focus on Aalenâ€™s Additive model,
# Using Aalen's Additive model
from lifelines import AalenAdditiveFitter
aaf = AalenAdditiveFitter(fit_intercept=False)
aaf.fit(regression_dataset, 'T', event_col='E')
Like CoxPHFitter
, after fitting youâ€™ll have access to properties like cumulative_hazards_
and methods like plot
, predict_cumulative_hazards
, and predict_survival_function
. The latter two methods require an additional argument of individual covariates:
X = regression_dataset.drop(['E', 'T'], axis=1)
aaf.predict_survival_function(X.iloc[10:12]).plot() # get the unique survival functions of two subjects
Like the above estimators, there is also a builtin plotting method:
aaf.plot()
Note
More detailed documentation and tutorials are available in Survival Regression.
Introduction to survival analysisÂ¶
ApplicationsÂ¶
Traditionally, survival analysis was developed to measure lifespans of individuals. An actuary or health professional would ask questions like â€śhow long does this population live for?â€ť, and answer it using survival analysis. For example, the population may be a nationâ€™s population (for actuaries), or a population stricken by a disease (in the medical professionalâ€™s case). Traditionally, sort of a morbid subject.
The analysis can be further applied to not just traditional births and deaths, but any duration. Medical professionals might be interested in the time between childbirths, where a birth in this case is the event of having a child, and a death is becoming pregnant again! (obviously, we are loose with our definitions of birth and death) Another example is users subscribing to a service: a birth is a user who joins the service, and a death is when the user leaves the service.
CensoringÂ¶
At the time you want to make inferences about durations, it is possible, likely true, that not all the death events have occurred yet. For example, a medical professional will not wait 50 years for each individual in the study to pass away before investigating â€“ he or she is interested in the effectiveness of improving lifetimes after only a few years, or months possibly.
The individuals in a population who have not been subject to the death event are labeled as rightcensored, i.e., we did not (or can not) view the rest of their life history due to some external circumstances. All the information we have on these individuals are their current lifetime durations (which is naturally less than their actual lifetimes).
Note
There is also leftcensoring, where an individuals birth event is not seen.
A common mistake data analysts make is choosing to ignore the rightcensored individuals. We will see why this is a mistake next.
Consider a case where the population is actually made up of two subpopulations, \(A\) and \(B\). Population \(A\) has a very small lifespan, say 2 months on average, and population \(B\) enjoys a much larger lifespan, say 12 months on average. We may not know this distinction beforehand. At \(t=10\), we wish to investigate the average lifespan for everyone.
In the figure below, the red lines denote the lifespan of individuals where the death event has been observed, and the blue lines denote the lifespan of the rightcensored individuals (deaths have not been observed). If we are asked to estimate the average lifetime of our population, and we naively decided to not included the rightcensored individuals, it is clear that we would be severely underestimating the true average lifespan.
from lifelines.plotting import plot_lifetimes
from numpy.random import uniform, exponential
N = 25
CURRENT_TIME = 10
actual_lifetimes = np.array([
exponential(12) if (uniform() < 0.5) else exponential(2) for i in range(N)
])
observed_lifetimes = np.minimum(actual_lifetimes, CURRENT_TIME)
death_observed = actual_lifetimes < CURRENT_TIME
ax = plot_lifetimes(observed_lifetimes, event_observed=death_observed)
ax.set_xlim(0, 25)
ax.vlines(10, 0, 30, lw=2, linestyles='')
ax.set_xlabel("time")
ax.set_title("Births and deaths of our population, at $t=10$")
print("Observed lifetimes at time %d:\n" % (CURRENT_TIME), observed_lifetimes)
Observed lifetimes at time 10:
[10. 1.1 8. 10. 3.43 0.63 6.28 1.03 2.37 6.17 10.
0.21 2.71 1.25 10. 3.4 0.62 1.94 0.22 7.43 6.16 10.
9.41 10. 10.]
Furthermore, if we instead simply took the mean of all observed lifespans, including the current lifespans of rightcensored instances, we would still be underestimating the true average lifespan. Below we plot the actual lifetimes of all instances (recall we do not see this information at \(t=10\)).
ax = plot_lifetimes(actual_lifetimes, event_observed=death_observed)
ax.vlines(10, 0, 30, lw=2, linestyles='')
ax.set_xlim(0, 25)
Survival analysis was originally developed to solve this type of problem, that is, to deal with estimation when our data is rightcensored. Even in the case where all events have been observed, i.e. no censoring, survival analysis is still a very useful tool to understand durations.
The observations need not always start at zero, either. This was done only for understanding in the above example. Consider the example where a customer entering a store is a birth: a customer can enter at any time, and not necessarily at time zero. In survival analysis, durations are relative: individuals may start at different times. (We actually only need the duration of the observation, and not necessarily the start and end time.)
We next introduce the two fundamental objects in survival analysis, the survival function and the hazard function.
Survival functionÂ¶
Let \(T\) be a (possibly infinite, but always nonnegative) random lifetime taken from the population under study. For example, the amount of time a couple is married. Or the time it takes a user to enter a webpage (an infinite time if they never do). The survival function  \(S(t)\)  of a population is defined as
In plain English: the survival function defines the probability the death event has not occurred yet at time \(t\), or equivalently, the probability of surviving past time \(t\). Note the following properties of the survival function:
 \(0 \le S(t) \le 1\)
 \(F_T(t) = 1  S(t)\), where \(F_T(t)\) is the CDF of \(T\), which implies
 \(S(t)\) is a nonincreasing function of \(t\).
Hazard curveÂ¶
We are also interested in the probability of the death event occurring at time \(t\), given that the death event has not occurred until time \(t\). Mathematically, that is:
This quantity goes to 0 as \(\delta t\) shrinks, so we divide this by the interval \(\delta t\) (like we might do in calculus). This defines the hazard function at time \(t\), \(h(t)\):
It can be shown that this is equal to:
and solving this differential equation (cool, it is a differential equation!), we get:
What I love about the above equation is that it defines all survival functions. Notice that we can now speak either about the survival function, \(S(t)\), or the hazard function, \(h(t)\), and we can convert back and forth quite easily. It also gives us another, albeit not as useful, expression for \(T\). Upon differentiation and some algebra, we recover:
Of course, we do not observe the true survival curve of a population. We must use the observed data to estimate it. There are many ways to estimate the survival function and the hazard rate, which brings us to estimation using lifelines.
Survival analysis with lifelinesÂ¶
In the previous section, we introduced the use of survival analysis, the need, and the mathematical objects on which it relies. In this article, we will work with real data and the lifelines library to estimate these mathematical objects.
Estimating the survival function using KaplanMeierÂ¶
For this example, we will be investigating the lifetimes of political leaders around the world. A political leader, in this case, is defined by a single individualâ€™s time in office who controls the ruling regime. This political leader could be an elected president, unelected dictator, monarch, etc. The birth event is the start of the individualâ€™s tenure, and the death event is the retirement of the individual. Censoring can occur if they are a) still in offices at the time of dataset compilation (2008), or b) die while in power (this includes assassinations).
For example, the Bush regime began in 2000 and officially ended in 2008 upon his retirement, thus this regimeâ€™s lifespan was eight years, and there was a â€śdeathâ€ť event observed. On the other hand, the JFK regime lasted 2 years, from 1961 and 1963, and the regimeâ€™s official death event was not observed â€“ JFK died before his official retirement.
(This is an example that has gladly redefined the birth and death events, and in fact completely flips the idea upside down by using deaths as the censoring event. This is also an example where the current time is not the only cause of censoring; there are the alternative events (e.g., death in office) that can be the cause of censoring.
To estimate the survival function, we first will use the KaplanMeier Estimate, defined:
where \(d_i\) are the number of death events at time \(t\) and \(n_i\) is the number of subjects at risk of death just prior to time \(t\).
Letâ€™s bring in our dataset.
from lifelines.datasets import load_dd
data = load_dd()
data.head()
ctryname  cowcode2  politycode  un_region_name  un_continent_name  ehead  leaderspellreg  democracy  regime  start_year  duration  observed 

Afghanistan  700  700  Southern Asia  Asia  Mohammad Zahir Shah  Mohammad Zahir Shah.Afghanistan.1946.1952.Monarchy  Nondemocracy  Monarchy  1946  7  1 
Afghanistan  700  700  Southern Asia  Asia  Sardar Mohammad Daoud  Sardar Mohammad Daoud.Afghanistan.1953.1962.Civilian Dict  Nondemocracy  Civilian Dict  1953  10  1 
Afghanistan  700  700  Southern Asia  Asia  Mohammad Zahir Shah  Mohammad Zahir Shah.Afghanistan.1963.1972.Monarchy  Nondemocracy  Monarchy  1963  10  1 
Afghanistan  700  700  Southern Asia  Asia  Sardar Mohammad Daoud  Sardar Mohammad Daoud.Afghanistan.1973.1977.Civilian Dict  Nondemocracy  Civilian Dict  1973  5  0 
Afghanistan  700  700  Southern Asia  Asia  Nur Mohammad Taraki  Nur Mohammad Taraki.Afghanistan.1978.1978.Civilian Dict  Nondemocracy  Civilian Dict  1978  1  0 
From the lifelines
library, weâ€™ll need the
KaplanMeierFitter
for this exercise:
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
Note
Other ways to estimate the survival function in lifelines are discussed below.
For this estimation, we need the duration each leader was/has been in office, and whether or not they were observed to have left office (leaders who died in office or were in office in 2008, the latest date this data was record at, do not have observed death events)
We next use the KaplanMeierFitter
method fit
to fit the model to
the data. (This is similar to, and inspired by,
scikitlearnâ€™s
fit/predict API).
Below we fit our data with the KaplanMeierFitter
:
T = data["duration"]
E = data["observed"]
kmf.fit(T, event_observed=E)
<lifelines.KaplanMeierFitter: fitted with 1808 observations, 340 censored>
After calling the fit
method, the KaplanMeierFitter
has a property
called survival_function_
(again, we follow the styling of
scikitlearn, and append an underscore to all properties that were computational estimated).
The property is a Pandas DataFrame, so we can call plot
on it:
kmf.survival_function_.plot()
plt.title('Survival function of political regimes');
How do we interpret this? The yaxis represents the probability a leader is still
around after \(t\) years, where \(t\) years is on the xaxis. We
see that very few leaders make it past 20 years in office. Of course,
like all good stats, we need to report how uncertain we are about these
point estimates, i.e., we need confidence intervals. They are computed in
the call to fit
, and located under the confidence_interval_
property. (The method uses exponential Greenwood confidence interval. The mathematics are found in these notes.)
Alternatively, we can call plot
on the KaplanMeierFitter
itself
to plot both the KM estimate and its confidence intervals:
kmf.plot()
The median time in office, which defines the point in time where on average 1/2 of the population has expired, is a property:
kmf.median_
# 4.0
Interesting that it is only four years. That means, around the world, elected leaders have a 50% chance of cessation in four years or less!
Letâ€™s segment on democratic regimes vs nondemocratic regimes. Calling
plot
on either the estimate itself or the fitter object will return
an axis
object, that can be used for plotting further estimates:
ax = plt.subplot(111)
dem = (data["democracy"] == "Democracy")
kmf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
kmf.plot(ax=ax)
kmf.fit(T[~dem], event_observed=E[~dem], label="Nondemocratic Regimes")
kmf.plot(ax=ax)
plt.ylim(0, 1);
plt.title("Lifespans of different global regimes");
We might be interested in estimating the probabilities in between some
points. We can do that with the timeline
argument. We specify the
times we are interested in and are returned a DataFrame with the
probabilities of survival at those points:
ax = plt.subplot(111)
t = np.linspace(0, 50, 51)
kmf.fit(T[dem], event_observed=E[dem], timeline=t, label="Democratic Regimes")
ax = kmf.plot(ax=ax)
print("Median survival time of democratic:", kmf.median_)
kmf.fit(T[~dem], event_observed=E[~dem], timeline=t, label="Nondemocratic Regimes")
ax = kmf.plot(ax=ax)
print("Median survival time of nondemocratic:", kmf.median_)
plt.ylim(0, 1)
plt.title("Lifespans of different global regimes");
Median survival time of democratic: Democratic Regimes 3
dtype: float64
Median survival time of nondemocratic: Nondemocratic Regimes 6
dtype: float64
It is incredible how much longer these nondemocratic regimes exist for. A democratic regime does have a natural bias towards death though: both via elections and natural limits (the US imposes a strict eightyear limit). The median of a nondemocratic is only about twice as large as a democratic regime, but the difference is apparent in the tails: if youâ€™re a nondemocratic leader, and youâ€™ve made it past the 10 year mark, you probably have a long life ahead. Meanwhile, a democratic leader rarely makes it past ten years, and then have a very short lifetime past that.
Here the difference between survival functions is very obvious, and
performing a statistical test seems pedantic. If the curves are more
similar, or we possess less data, we may be interested in performing a
statistical test. In this case, lifelines contains routines in
lifelines.statistics
to compare two survival curves. Below we
demonstrate this routine. The function logrank_test
is a common
statistical test in survival analysis that compares two event seriesâ€™
generators. If the value returned exceeds some prespecified value, then
we rule that the series have different generators.
from lifelines.statistics import logrank_test
results = logrank_test(T[dem], T[~dem], E[dem], E[~dem], alpha=.99)
results.print_summary()
<lifelines.StatisticalResult>
t_0 = 1
null_distribution = chi squared
degrees_of_freedom = 1
alpha = 0.99

test_statistic p log2(p)
260.47 <0.005 192.23
Lets compare the different types of regimes present in the dataset:
regime_types = data['regime'].unique()
for i, regime_type in enumerate(regime_types):
ax = plt.subplot(2, 3, i + 1)
ix = data['regime'] == regime_type
kmf.fit(T[ix], E[ix], label=regime_type)
kmf.plot(ax=ax, legend=False)
plt.title(regime_type)
plt.xlim(0, 50)
if i==0:
plt.ylabel('Frac. in power after $n$ years')
plt.tight_layout()
There are alternative (and sometimes better) tests of survival curves, and we explain more here: Statistically compare two populations
Getting data into the right formatÂ¶
lifelines data format is consistent across all estimator class and
functions: an array of individual durations, and the individuals
event observation (if any). These are often denoted T
and E
respectively. For example:
T = [0,3,3,2,1,2]
E = [1,1,0,0,1,1]
kmf.fit(T, event_observed=E)
The raw data is not always available in this format â€“ lifelines
includes some helper functions to transform data formats to lifelines
format. These are located in the lifelines.utils
sublibrary. For
example, the function datetimes_to_durations
accepts an array or
Pandas object of start times/dates, and an array or Pandas objects of
end times/dates (or None
if not observed):
from lifelines.utils import datetimes_to_durations
start_date = ['20131010 0:00:00', '20131009', '20131010']
end_date = ['20131013', '20131010', None]
T, E = datetimes_to_durations(start_date, end_date, fill_date='20131015')
print('T (durations): ', T)
print('E (event_observed): ', E)
T (durations): [ 3. 1. 5.]
E (event_observed): [ True True False]
The function datetimes_to_durations
is very flexible, and has many
keywords to tinker with.
Estimating hazard rates using NelsonAalenÂ¶
The survival curve is a great way to summarize and visualize the survival dataset, however it is not the only way. If we are curious about the hazard function \(h(t)\) of a population, we unfortunately cannot transform the Kaplan Meier estimate â€“ statistics doesnâ€™t work quite that well. Fortunately, there is a proper nonparametric estimator of the cumulative hazard function:
The estimator for this quantity is called the Nelson Aalen estimator:
where \(d_i\) is the number of deaths at time \(t_i\) and \(n_i\) is the number of susceptible individuals.
In lifelines, this estimator is available as the NelsonAalenFitter
. Letâ€™s use the regime dataset from above:
T = data["duration"]
E = data["observed"]
from lifelines import NelsonAalenFitter
naf = NelsonAalenFitter()
naf.fit(T,event_observed=E)
After fitting, the class exposes the property cumulative_hazard_
as
a DataFrame:
print(naf.cumulative_hazard_.head())
naf.plot()
NAestimate
0 0.000000
1 0.325912
2 0.507356
3 0.671251
4 0.869867
[5 rows x 1 columns]
The cumulative hazard has less immediate understanding than the survival curve, but the hazard curve is the basis of more advanced techniques in survival analysis. Recall that we are estimating cumulative hazard curve, \(H(t)\). (Why? The sum of estimates is much more stable than the pointwise estimates.) Thus we know the rate of change of this curve is an estimate of the hazard function.
Looking at figure above, it looks like the hazard starts off high and gets smaller (as seen by the decreasing rate of change). Letâ€™s break the regimes down between democratic and nondemocratic, during the first 20 years:
Note
We are using the loc
argument in the call to plot
here: it accepts a slice
and plots only points within that slice.
naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot(loc=slice(0, 20))
naf.fit(T[~dem], event_observed=E[~dem], label="Nondemocratic Regimes")
naf.plot(ax=ax, loc=slice(0, 20))
plt.title("Cumulative hazard function of different global regimes");
Looking at the rates of change, I would say that both political philosophies have a constant hazard, albeit democratic regimes have a much higher constant hazard. So why did the combination of both regimes have a decreasing hazard? This is the effect of frailty.
Smoothing the hazard curveÂ¶
Interpretation of the cumulative hazard function can be difficult â€“ it is not how we usually interpret functions. (On the other hand, most survival analysis is done using the cumulative hazard function, so understanding it is recommended).
Alternatively, we can derive the moreinterpretable hazard curve, but
there is a catch. The derivation involves a kernel smoother (to smooth
out the differences of the cumulative hazard curve) , and this requires
us to specify a bandwidth parameter that controls the amount of
smoothing. This functionality is in the smoothed_hazard_
and hazard_confidence_intervals_
methods. (Why methods? They require
an argument representing the bandwidth).
There is also a plot_hazard
function (that also requires a
bandwidth
keyword) that will plot the estimate plus the confidence
intervals, similar to the traditional plot
functionality.
bandwidth = 3.
naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot_hazard(bandwidth=bandwidth)
naf.fit(T[~dem], event_observed=E[~dem], label="Nondemocratic Regimes")
naf.plot_hazard(ax=ax, bandwidth=bandwidth)
plt.title("Hazard function of different global regimes  bandwidth=%.1f" % bandwidth);
plt.ylim(0, 0.4)
plt.xlim(0, 25);
It is more clear here which group has the higher hazard, and like hypothesized above, both hazard rates are close to being constant.
There is no obvious way to choose a bandwidth, and different bandwidths produce different inferences, so itâ€™s best to be very careful here. (My advice: stick with the cumulative hazard function.)
bandwidth = 8.0
naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot_hazard(bandwidth=bandwidth)
naf.fit(T[~dem], event_observed=E[~dem], label="Nondemocratic Regimes")
naf.plot_hazard(ax=ax, bandwidth=bandwidth)
plt.title("Hazard function of different global regimes  bandwidth=%.1f" % bandwidth);
Estimating cumulative hazards using parametric modelsÂ¶
Fitting to a Weibull modelÂ¶
Another very popular model for survival data is the Weibull model. In contrast the the NelsonAalen estimator, this model is a parametric model, meaning it has a functional form with parameters that we are fitting the data to. (The NelsonAalen estimator has no parameters to fit to). Mathematically, the survival function looks like:
A priori, we do not know what \(\lambda\) and \(\rho\) are, but we use the data on hand to estimate these parameters. We model and estimate the cumulative hazard rate instead of the survival function (this is different than the KaplanMeier estimator):
In lifelines, estimation is available using the WeibullFitter
class. The plot
method will plot the cumulative hazard.
from lifelines import WeibullFitter
from lifelines.datasets import load_waltons
data = load_waltons()
T = data['T']
E = data['E']
wf = WeibullFitter()
wf.fit(T, E)
wf.print_summary()
wf.plot()
<lifelines.WeibullFitter: fitted with 163 observations, 7 censored>
number of subjects = 163
number of events = 156
loglikelihood = 672.062
hypothesis = lambda != 1, rho != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
lambda_ 0.02 0.00 0.02 0.02 <0.005 inf
rho_ 3.45 0.24 2.97 3.93 <0.005 76.83
Other parametric models: Exponential, LogLogistic & LogNormalÂ¶
Similarly, there are other parametric models in lifelines. Generally, which parametric model to choose is determined by either knowledge of the distribution of durations, or some sort of model goodnessoffit. Below are the builtin parametric models, and the NelsonAalen nonparametric model, of the same data.
from lifelines import WeibullFitter
from lifelines import ExponentialFitter
from lifelines import LogNormalFitter
from lifelines import LogLogisticFitter
from lifelines import NelsonAalenFitter
from lifelines import PiecewiseExponentialFitter
from lifelines.datasets import load_waltons
data = load_waltons()
fig, axes = plt.subplots(2, 3, figsize=(9, 5))
T = data['T']
E = data['E']
wbf = WeibullFitter().fit(T, E, label='WeibullFitter')
exf = ExponentialFitter().fit(T, E, label='ExponentalFitter')
lnf = LogNormalFitter().fit(T, E, label='LogNormalFitter')
naf = NelsonAalenFitter().fit(T, E, label='NelsonAalenFitter')
llf = LogLogisticFitter().fit(T, E, label='LogLogisticFitter')
pwf = PiecewiseExponentialFitter([40, 60]).fit(T, E, label='PiecewiseExponentialFitter')
wbf.plot_cumulative_hazard(ax=axes[0][0])
exf.plot_cumulative_hazard(ax=axes[0][1])
lnf.plot_cumulative_hazard(ax=axes[0][2])
naf.plot_cumulative_hazard(ax=axes[1][0])
llf.plot_cumulative_hazard(ax=axes[1][1])
pwf.plot_cumulative_hazard(ax=axes[1][2])
lifelines can also be used to define your own parametic model. There is a tutorial on this available, see Piecewise Exponential Models and Creating Custom Models.
Parametric models can also be used to create and plot the survival function, too. Below we compare the parametic models versus the nonparametric KaplanMeier estimate:
from lifelines import KaplanMeierFitter
fig, axes = plt.subplots(2, 3, figsize=(9, 5))
T = data['T']
E = data['E']
kmf = KaplanMeierFitter().fit(T, E, label='KaplanMeierFitter')
wbf = WeibullFitter().fit(T, E, label='WeibullFitter')
exf = ExponentialFitter().fit(T, E, label='ExponentalFitter')
lnf = LogNormalFitter().fit(T, E, label='LogNormalFitter')
llf = LogLogisticFitter().fit(T, E, label='LogLogisticFitter')
pwf = PiecewiseExponentialFitter([40, 60]).fit(T, E, label='PiecewiseExponentialFitter')
wbf.plot_survival_function(ax=axes[0][0])
exf.plot_survival_function(ax=axes[0][1])
lnf.plot_survival_function(ax=axes[0][2])
kmf.plot_survival_function(ax=axes[1][0])
llf.plot_survival_function(ax=axes[1][1])
pwf.plot_survival_function(ax=axes[1][2])
Other types of censoringÂ¶
Left censored dataÂ¶
Weâ€™ve mainly been focusing on rightcensoring, which describes cases where we do not observe the death event. This situation is the most common one. Alternatively, there are situations where we do not observe the birth event occurring. Consider the case where a doctor sees a delayed onset of symptoms of an underlying disease. The doctor is unsure when the disease was contracted (birth), but knows it was before the discovery.
Another situation where we have leftcensored data is when measurements have only an upper bound, that is, the measurements instruments could only detect the measurement was less than some upper bound.
lifelines has support for leftcensored datasets in the KaplanMeierFitter
class, by adding the keyword left_censoring=True
(default False
) to the call to fit
.
from lifelines.datasets import load_lcd
lcd_dataset = load_lcd()
ix = lcd_dataset['group'] == 'alluvial_fan'
T = lcd_dataset[ix]['T']
E = lcd_dataset[ix]['E'] #boolean array, True if observed.
kmf = KaplanMeierFitter()
kmf.fit(T, E, left_censoring=True)
Instead of producing a survival function, leftcensored data is more interested in the cumulative density function
of time to birth. This is available as the cumulative_density_
property after fitting the data.
print(kmf.cumulative_density_)
kmf.plot() #will plot the CDF
Left truncated dataÂ¶
Note
Not to be confused with leftcensoring, which is also supported in
KaplanMeierFitter
.
Another form of bias that is introduced into a dataset is called lefttruncation. Lefttruncation can occur in many situations. One situation is when individuals may have the opportunity to die before entering into the study. For example, if you are measuring time to death of prisoners in prison, the prisoners will enter the study at different ages. Some theoretical indiviuals who would have entered into your study (i.e. went to prison) have already died.
Another situation with lefttruncation occurs when subjects are exposed before entry into study. For example, a study of time to allcause mortality of AIDS patients that recruited indivduals previously diagnosed with AIDS, possibly years before.
All univatiate fitters, like KaplanMeierFitter
and any parametric models, have an optional argument for entry
, which is an array of equal size to the duration array. It describes the time between â€śbirthâ€ť (or â€śexposureâ€ť) to entering the study.
Note
Nothing changes in the duration array: it still measures time from â€śbirthâ€ť to time exited study (either by death or censoring). That is, durations refers to the absolute death time rather than a duration relative to the study entry.
Note
Other types of censoring, like intervalcensoring, are not implemented in lifelines yet.
Survival regressionÂ¶
Often we have additional data aside from the duration, and if applicable any censorings that occurred. In the previous sectionâ€™s regime dataset, we have the type of government the political leader was part of, the country they were head of, and the year they were elected. Can we use this data in survival analysis?
Yes, the technique is called survival regression â€“ the name implies we regress covariates (e.g., age, country, etc.) against another variable â€“ in this case durations. Similar to the logic in the first part of this tutorial, we cannot use traditional methods like linear regression.
There are two popular techniques in survival regression: Coxâ€™s model and Aalenâ€™s additive model. Both models attempt to represent the hazard rate \(h(t  x)\) as a function of \(t\) and some covariates \(x\). We explore these models next.
Coxâ€™s proportional hazard modelÂ¶
lifelines has an implementation of the Cox proportional hazards regression model (implemented in
R as coxph
). The idea behind the model is that the loghazard of an individual is a linear function of their static covariates and a populationlevel baseline hazard that changes over time. Mathematically:
Note a few facts about this model: the only time component is in the baseline hazard, \(b_0(t)\). In the above product, the partial hazard is a timeinvariant scalar factor that only increases or decreases the baseline hazard. Thus a changes in covariates will only increase or decrease the baseline hazard.
The dataset for regressionÂ¶
The dataset required for survival regression must be in the format of a Pandas DataFrame. Each row of the DataFrame should be an observation. There should be a column denoting the durations of the observations. There may be a column denoting the event status of each observation (1 if event occured, 0 if censored). There are also the additional covariates you wish to regress against. Optionally, there could be columns in the DataFrame that are used for stratification, weights, and clusters which will be discussed later in this tutorial.
Note
In other regression models, a column of 1s might be added that represents that intercept or baseline. This is not necessary in the Cox model. In fact, there is no intercept in the additive Cox model  the baseline hazard represents this. lifelines will will throw warnings and may experience convergence errors if a column of 1s is present in your dataset.
An example dataset is called the Rossi recidivism dataset, available in lifelines as datasets.load_rossi
.
from lifelines.datasets import load_rossi
rossi = load_rossi()
"""
week arrest fin age race wexp mar paro prio
0 20 1 0 27 1 0 0 1 3
1 17 1 0 18 1 0 0 1 8
2 25 1 0 19 0 1 0 1 13
3 52 0 1 23 1 1 1 1 1
"""
The dataframe rossi
contains 432 observations. The week
column is the duration, the arrest
column is the event occured, and the other columns represent variables we wish to regress against.
If you need to first clean or transform your dataset (encode categorical variables, add interation terms, etc.), that should happen before using lifelines. Libraries like Pandas and Patsy help with that.
Running the regressionÂ¶
The implementation of the Cox model in lifelines is called CoxPHFitter
. Like R, it has a print_summary
function that prints a tabular view of coefficients and related stats.
from lifelines import CoxPHFitter
from lifelines.datasets import load_rossi
rossi_dataset = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi_dataset, duration_col='week', event_col='arrest', show_progress=True)
cph.print_summary() # access the results using cph.summary
"""
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
number of subjects = 432
number of events = 114
loglikelihood = 658.75
time fit was run = 20190127 23:10:15 UTC

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
fin 0.38 0.68 0.19 1.98 0.05 4.40 0.75 0.00
age 0.06 0.94 0.02 2.61 0.01 6.79 0.10 0.01
race 0.31 1.37 0.31 1.02 0.31 1.70 0.29 0.92
wexp 0.15 0.86 0.21 0.71 0.48 1.06 0.57 0.27
mar 0.43 0.65 0.38 1.14 0.26 1.97 1.18 0.31
paro 0.08 0.92 0.20 0.43 0.66 0.59 0.47 0.30
prio 0.09 1.10 0.03 3.19 <0.005 9.48 0.04 0.15

Concordance = 0.64
Likelihood ratio test = 33.27 on 7 df, log2(p)=15.37
"""
To access the coefficients and the baseline hazard directly, you can use cph.hazards_
and cph.baseline_hazard_
respectively.
ConvergenceÂ¶
Fitting the Cox model to the data involves using gradient descent. lifelines takes extra effort to help with convergence, so please be attentive to any warnings that appear. Fixing any warnings will generally help convergence. If you wish to see the fitting, there is a show_progress
parameter in CoxPHFitter.fit
function. For further help, see Problems with convergence in the Cox proportional hazard model.
After fitting, the value of the maximum loglikelihood this available using cph._log_likelihood
. Similarly, the score and Hessian matrix are available under _score_
and _hessian_
respectively.
Goodness of fitÂ¶
After fitting, you may want to know how â€śgoodâ€ť of a fit your model was to the data. Aside from traditional approaches, two methods the author has found useful is to 1. look at the concordanceindex (see below section on Model selection in survival regression), available as cph.score_
or in the print_summary
and 2. compare spread between the baseline survival function vs the Kaplan Meier survival function (Why? Interpret the spread as how much â€śvarianceâ€ť is provided by the baseline hazard versus the partial hazard. The baseline hazard is approximately equal to the KaplanMeier curve if none of the variance is explained by the covariates / partial hazard. Deviations from this provide a visual measure of variance explained). For example, the first figure below is a good fit, and the second figure is a much weaker fit.
PredictionÂ¶
After fitting, you can use use the suite of prediction methods: .predict_partial_hazard
, .predict_survival_function
, etc.
X = rossi_dataset.drop(["week", "arrest"], axis=1)
cph.predict_partial_hazard(X)
cph.predict_survival_function(X, times=[5., 25., 50.])
cph.predict_median(X)
A common use case is to predict the event time of censored subjects. This is easy to do, but we first have to calculate an important conditional probability. Let \(T\) be the (random) event time for some subject, and \(S(t)â‰”P(T > t)\) be their surival function. We are interested to know What is the new survival function, given I know the subject has lived past time s, where s < t? Mathmematically:
Thus we scale the original survival function by the survival function at time \(s\) (everything prior to \(s\) should be mapped to 1.0 as well, since we are working with probabilities and we know that the subject was alive before \(s\)).
Back to our original problem of predicting the event time of censored individuals, we do the same thing:
from lifelines import CoxPHFitter
from lifelines.datasets import load_regression_dataset
df = load_regression_dataset()
cph = CoxPHFitter().fit(df, 'T', 'E')
censored_subjects = df.loc[df['E'] == 0]
unconditioned_sf = cph.predict_survival_function(censored_subjects)
conditioned_sf = unconditioned_sf.apply(lambda c: (c / c.loc[df.loc[c.name, 'T']]).clip_upper(1))
# let's focus on a single subject
subject = 13
unconditioned_sf[subject].plot(ls="", color="#A60628", label="unconditioned")
conditioned_sf[subject].plot(color="#A60628", label="conditioned on $T>10$")
plt.legend()
From here, you can pick a median or percentile as a best guess as to the subjectâ€™s event time:
from lifelines.utils import median_survival_times, qth_survival_times
predictions_50 = median_survival_times(conditioned_sf)
predictions_75 = qth_survival_times(0.75, conditioned_sf)
# plotting subject 13 again
plt.hlines([0.5, 0.75], 0, 23, alpha=0.5, label="percentiles")
plt.scatter(median_survival_times(conditioned_sf[subject]), 0.5, color="#E24A33", label="median prediction", zorder=20)
plt.scatter(qth_survival_times(0.75, conditioned_sf[subject]), 0.75, color="#467821", label="q=75 prediction", zorder=20)
plt.legend()
Plotting the coefficientsÂ¶
With a fitted model, an alternative way to view the coefficients and their ranges is to use the plot
method.
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
rossi_dataset = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi_dataset, duration_col='week', event_col='arrest', show_progress=True)
cph.plot()
Plotting the effect of varying a covariateÂ¶
After fitting, we can plot what the survival curves look like as we vary a single covarite while
holding everything else equal. This is useful to understand the impact of a covariate, given the model. To do this, we use the plot_covariate_groups
method and give it the covariate of interest, and the values to display.
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
rossi_dataset = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi_dataset, duration_col='week', event_col='arrest', show_progress=True)
cph.plot_covariate_groups('prio', [0, 5, 10, 15])
Checking the proportional hazards assumptionÂ¶
CoxPHFitter
has a check_assumptions
method that will output violations of the proportional hazard assumption. For a tutorial on how to fix violations, see Testing the Proportional Hazard Assumptions.
Nonproportional hazards is a case of model misspecification. Suggestions are to look for ways to stratify a column (see docs below), or use a timevarying model (see docs much further below).
StratificationÂ¶
Sometimes one or more covariates may not obey the proportional hazard assumption. In this case, we can allow the covariate(s) to still be including in the model without estimating its effect. This is called stratification. At a high level, think of it as splitting the dataset into N smaller datasets, defined by the unique values of the stratifing covariate(s). Each dataset has its own baseline hazard (the nonparametric part of the model), but they all share the regression parameters (the parametric part of the model). Since covariates are the same within each dataset, there is no regression parameter for the covariates stratified on, hence they will not show up in the output. However there will be N baseline hazards under baseline_cumulative_hazard_
.
To specify variables to be used in stratification, we define them in the call to fit
:
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
rossi_dataset = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi_dataset, 'week', event_col='arrest', strata=['race'], show_progress=True)
cph.print_summary() # access the results using cph.summary
"""
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
strata = ['race']
number of subjects = 432
number of events = 114
loglikelihood = 620.56
time fit was run = 20190127 23:08:35 UTC

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
fin 0.38 0.68 0.19 1.98 0.05 4.39 0.75 0.00
age 0.06 0.94 0.02 2.62 0.01 6.83 0.10 0.01
wexp 0.14 0.87 0.21 0.67 0.50 0.99 0.56 0.27
mar 0.44 0.64 0.38 1.15 0.25 2.00 1.19 0.31
paro 0.09 0.92 0.20 0.44 0.66 0.60 0.47 0.30
prio 0.09 1.10 0.03 3.21 <0.005 9.56 0.04 0.15

Concordance = 0.64
Likelihood ratio test = 109.63 on 6 df, log2(p)=68.48
"""
cph.baseline_cumulative_hazard_.shape
# (49, 2)
Weights & robust errorsÂ¶
Observations can come with weights, as well. These weights may be integer values representing some commonly occuring observation, or they may be float values representing some sampling weights (ex: inverse probability weights). In the CoxPHFitter.fit
method, an kwarg is present for specifying which column in the dataframe should be used as weights, ex: CoxPHFitter(df, 'T', 'E', weights_col='weights')
.
When using sampling weights, itâ€™s correct to also change the standard error calculations. That is done by turning on the robust
flag in fit
. Interally, CoxPHFitter
will use the sandwhich estimator to compute the errors.
from lifelines import CoxPHFitter
df = pd.DataFrame({
'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
'weights': [1.1, 0.5, 2.0, 1.6, 1.2, 4.3, 1.4, 4.5, 3.0, 3.2, 0.4, 6.2],
'month': [10, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
})
cph = CoxPHFitter()
cph.fit(df, 'T', 'E', weights_col='weights', robust=True)
cph.print_summary()
See more examples in Adding weights to observations in a Cox model.
Clusters & correlationsÂ¶
Another property your dataset may have is groups of related subjects. This could be caused by:
 a single individual having multiple occurrences, and hence showing up in the dataset more than once.
 subjects that share some common property, like members of the same family or being matched on prospensity scores.
We call these grouped subjects â€śclustersâ€ť, and assume they are designated by some column in the dataframe (example below). When using clustesr, the point estimates of the model donâ€™t change, but the standard errors will increase. An intuitive argument for this is that 100 observations on 100 individuals provide more information than 100 observations on 10 individuals (or clusters).
from lifelines import CoxPHFitter
df = pd.DataFrame({
'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
'month': [10, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'id': [1, 1, 1, 1, 2, 3, 3, 4, 4, 5, 6, 7]
})
cph = CoxPHFitter()
cph.fit(df, 'T', 'E', cluster_col='id')
cph.print_summary()
For more examples, see Correlations between subjects in a Cox model.
ResidualsÂ¶
After fitting a Cox model, we can look back and compute important model residuals. These residuals can tell us about nonlinearities not captured, violations of proportional hazards, and help us answer other useful modelling questions. See Assessing Cox model fit using residuals.
Aalenâ€™s additive modelÂ¶
Warning
This implementation is still experimental.
Note
This API of this model changed in version 0.17.0
Aalenâ€™s Additive model is another regression model we can use. Like the Cox model, it defines the hazard rate, but instead of the linear model being multiplicative like the Cox model, the Aalen model is additive. Specifically:
Inference typically does not estimate the individual
\(b_i(t)\) but instead estimates \(\int_0^t b_i(s) \; ds\)
(similar to the estimate of the hazard rate using NelsonAalenFitter
). This is important
when interpreting plots produced.
For this
exercise, we will use the regime dataset and include the categorical
variables un_continent_name
(eg: Asia, North America,â€¦), the
regime
type (e.g., monarchy, civilian,â€¦) and the year the regime
started in, start_year
. The estimator to fit unknown coefficients in Aalenâ€™s additive model is
located under lifelines.AalenAdditiveFitter
.
from lifelines import AalenAdditiveFitter
from lifelines.datasets import load_dd
data = load_dd()
data.head()
ctryname  cowcode2  politycode  un_region_name  un_continent_name  ehead  leaderspellreg  democracy  regime  start_year  duration  observed 

Afghanistan  700  700  Southern Asia  Asia  Mohammad Zahir Shah  Mohammad Zahir Shah.Afghanistan.1946.1952.Monarchy  Nondemocracy  Monarchy  1946  7  1 
Afghanistan  700  700  Southern Asia  Asia  Sardar Mohammad Daoud  Sardar Mohammad Daoud.Afghanistan.1953.1962.Civilian Dict  Nondemocracy  Civilian Dict  1953  10  1 
Afghanistan  700  700  Southern Asia  Asia  Mohammad Zahir Shah  Mohammad Zahir Shah.Afghanistan.1963.1972.Monarchy  Nondemocracy  Monarchy  1963  10  1 
Afghanistan  700  700  Southern Asia  Asia  Sardar Mohammad Daoud  Sardar Mohammad Daoud.Afghanistan.1973.1977.Civilian Dict  Nondemocracy  Civilian Dict  1973  5  0 
Afghanistan  700  700  Southern Asia  Asia  Nur Mohammad Taraki  Nur Mohammad Taraki.Afghanistan.1978.1978.Civilian Dict  Nondemocracy  Civilian Dict  1978  1  0 
Iâ€™m using the lovely library patsy here to create a design matrix from my original dataframe.
import patsy
X = patsy.dmatrix('un_continent_name + regime + start_year', data, return_type='dataframe')
X = X.rename(columns={'Intercept': 'baseline'})
print(X.columns.tolist())
['baseline',
'un_continent_name[T.Americas]',
'un_continent_name[T.Asia]',
'un_continent_name[T.Europe]',
'un_continent_name[T.Oceania]',
'regime[T.Military Dict]',
'regime[T.Mixed Dem]',
'regime[T.Monarchy]',
'regime[T.Parliamentary Dem]',
'regime[T.Presidential Dem]',
'start_year']
We have also included the coef_penalizer
option. During the estimation, a
linear regression is computed at each step. Often the regression can be
unstable (due to high
colinearity or small sample sizes) â€“ adding a penalizer term controls the stability. I recommend always starting with a small penalizer term â€“ if the estimates still appear to be too unstable, try increasing it.
aaf = AalenAdditiveFitter(coef_penalizer=1.0, fit_intercept=False)
An instance of AalenAdditiveFitter
includes a fit
method that performs the inference on the coefficients. This method accepts a pandas DataFrame: each row is an individual and columns are the covariates and
two individual columns: a duration column and a boolean event occurred column (where event occurred refers to the event of interest  expulsion from government in this case)
X['T'] = data['duration']
X['E'] = data['observed']
aaf.fit(X, 'T', event_col='E')
After fitting, the instance exposes a cumulative_hazards_
DataFrame
containing the estimates of \(\int_0^t b_i(s) \; ds\):
aaf.cumulative_hazards_.head()
baseline  un_continent_name[T.Americas]  un_continent_name[T.Asia]  un_continent_name[T.Europe]  un_continent_name[T.Oceania]  regime[T.Military Dict]  regime[T.Mixed Dem]  regime[T.Monarchy]  regime[T.Parliamentary Dem]  regime[T.Presidential Dem]  start_year 

0.03447  0.03173  0.06216  0.2058  0.009559  0.07611  0.08729  0.1362  0.04885  0.1285  0.000092 
0.14278  0.02496  0.11122  0.2083  0.079042  0.11704  0.36254  0.2293  0.17103  0.1238  0.000044 
0.30153  0.07212  0.10929  0.1614  0.063030  0.16553  0.68693  0.2738  0.33300  0.1499  0.000004 
0.37969  0.06853  0.15162  0.2609  0.185569  0.22695  0.95016  0.2961  0.37351  0.4311  0.000032 
0.36749  0.20201  0.21252  0.2429  0.188740  0.25127  1.15132  0.3926  0.54952  0.7593  0.000000 
AalenAdditiveFitter
also has built in plotting:
aaf.plot(columns=['regime[T.Presidential Dem]', 'baseline', 'un_continent_name[T.Europe]'], iloc=slice(1,15))
Regression is most interesting if we use it on data we have not yet seen, i.e., prediction! We can use what we have learned to predict individual hazard rates, survival functions, and median survival time. The dataset we are using is available up until 2008, so letâ€™s use this data to predict the duration of former Canadian Prime Minister Stephen Harper.
ix = (data['ctryname'] == 'Canada') & (data['start_year'] == 2006)
harper = X.loc[ix]
print("Harper's unique data point:")
print(harper)
Harper's unique data point:
baseline un_continent_name[T.Americas] un_continent_name[T.Asia] ... start_year T E
268 1.0 1.0 0.0 ... 2006.0 3 0
ax = plt.subplot(2,1,1)
aaf.predict_cumulative_hazard(harper).plot(ax=ax)
ax = plt.subplot(2,1,2)
aaf.predict_survival_function(harper).plot(ax=ax);
Warning
Because of the nature of the model, estimated survival functions of individuals can increase. This is an expected artifact of Aalenâ€™s additive model.
Coxâ€™s time varying proportional hazard modelÂ¶
Warning
This implementation is still experimental.
Often an individual will have a covariate change over time. An example of this is hospital patients who enter the study and, at some future time, may recieve a heart transplant. We would like to know the effect of the transplant, but we cannot condition on whether they recieved the transplant naively. Consider that if patients needed to wait at least 1 year before getting a transplant, then everyone who dies before that year is considered as a nontransplant patient, and hence this would overestimate the hazard of not recieving a transplant.
We can incorporate changes over time into our survival analysis by using a modification of the Cox model above. The general mathematical description is:
Note the timevarying \(x_i(t)\) to denote that covariates can change over time. This model is implemented in lifelines as CoxTimeVaryingFitter
. The dataset schema required is different than previous models, so we will spend some time describing this.
Dataset creation for timevarying regressionÂ¶
lifelines requires that the dataset be in what is called the long format. This looks like one row per state change, including an ID, the left (exclusive) time point, and right (inclusive) time point. For example, the following dataset tracks three unique subjects.
id  start  stop  group  z  event 

1  0  8  1  0  False 
2  0  5  0  0  False 
2  5  8  0  1  True 
3  0  3  1  0  False 
3  3  12  1  1  True 
In the above dataset, start
and stop
denote the boundaries, id
is the unique identifier per subject, and event
denotes if the subject died at the end of that period. For example, subject ID 2 had variable z=0
up to and including the end of time period 5 (we can think that measurements happen at end of the time period), after which it was set to 1. Since event
is 1 in that row, we conclude that the subject died at time 8,
This desired dataset can be built up from smaller datasets. To do this we can use some helper functions provided in lifelines. Typically, data will be in a format that looks like it comes out of a relational database. You may have a â€śbaseâ€ť table with ids, durations alive, and a censorsed flag, and possibly static covariates. Ex:
id  duration  event  var1 

1  10  True  0.1 
2  12  False  0.5 
We will perform a light transform to this dataset to modify it into the â€ślongâ€ť format.
from lifelines.utils import to_long_format
base_df = to_long_format(base_df, duration_col="duration")
The new dataset looks like:
id  start  stop  var1  event 

1  0  10  0.1  True 
2  0  12  0.5  False 
Youâ€™ll also have secondary dataset that references future measurements. This could come in two â€śtypesâ€ť. The first is when you have a variable that changes over time (ex: administering varying medication over time, or taking a tempature over time). The second types is an eventbased dataset: an event happens at some time in the future (ex: an organ transplant occurs, or an intervention). We will address this second type later. The first type of dataset may look something like:
Example:
id  time  var2 

1  0  1.4 
1  4  1.2 
1  8  1.5 
2  0  1.6 
where time
is the duration from the entry event. Here we see subject 1 had a change in their var2
covariate at the end of time 4 and at the end of time 8. We can use add_covariate_to_timeline
to fold the covariate dataset into the original dataset.
from lifelines.utils import add_covariate_to_timeline
df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="event")
id  start  stop  var1  var2  event 

1  0  4  0.1  1.4  False 
1  4  8  0.1  1.2  False 
1  8  10  0.1  1.5  True 
2  0  12  0.5  1.6  False 
From the above output, we can see that subject 1 changed state twice over the observation period, finally expiring at the end of time 10. Subject 2 was a censored case, and we lost track of them after time 12.
You may have multiple covariates you wish to add, so the above could be streamlined like so:
from lifelines.utils import add_covariate_to_timeline
df = base_df.pipe(add_covariate_to_timeline, cv1, duration_col="time", id_col="id", event_col="event")\
.pipe(add_covariate_to_timeline, cv2, duration_col="time", id_col="id", event_col="event")\
.pipe(add_covariate_to_timeline, cv3, duration_col="time", id_col="id", event_col="event")
If your dataset is of the second type, that is, eventbased, your dataset may look something like the following, where values in the matrix denote times since the subjectâ€™s birth, and None
or NaN
represent the event not happening (subjects can be excluded if the event never occurred as well) :
print(event_df)
id E1
0 1 1.0
1 2 NaN
2 3 3.0
...
Initially, this canâ€™t be added to our baseline dataframe. However, using utils.covariates_from_event_matrix
we can convert a dataframe like this into one that can be easily added.
from lifelines.utils import covariates_from_event_matrix
cv = covariates_from_event_matrix(event_df, id_col="id")
print(cv)
event id duration E1
0 1 1.0 1
1 3 3.0 1
...
base_df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="E")
For an example of pulling datasets like this from a SQLstore, and other helper functions, see Example SQL queries and transformations to get time varying data.
Cumulative sumsÂ¶
One additional flag on add_covariate_to_timeline
that is of interest is the cumulative_sum
flag. By default it is False, but turning it to True will perform a cumulative sum on the covariate before joining. This is useful if the covariates describe an incremental change, instead of a state update. For example, we may have measurements of drugs administered to a patient, and we want the covariate to reflect how much we have administered since the start. Event columns do make sense to cumulative sum as well. In contrast, a covariate to measure the temperature of the patient is a state update, and should not be summed. See Example cumulative total using and timevarying covariates to see an example of this.
Delaying timevarying covariatesÂ¶
add_covariate_to_timeline
also has an option for delaying, or shifting, a covariate so it changes later than originally observed. One may ask, why should one delay a timevarying covariate? Hereâ€™s an example. Consider investigating the impact of smoking on mortality and available to us are timevarying observations of how many cigarettes are consumed each month. Unbeknownst to us, when a subject reaches critical illness levels, they are admitted to the hospital and their cigarette consumption drops to zero. Some expire while in hospital. If we used this dataset naively, we would see that not smoking leads to sudden death, and conversely, smoking helps your health! This is a case of reverse causation: the upcoming death event actually influences the covariates.
To handle this, you can delay the observations by time periods:
from lifelines.utils import covariates_from_event_matrix
base_df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="E", delay=14)
Fitting the modelÂ¶
Once your dataset is in the correct orientation, we can use CoxTimeVaryingFitter
to fit the model to your data. The method is similar to CoxPHFitter
, expect we need to tell the fit
about the additional time columns.
Fitting the Cox model to the data involves using gradient descent. lifelines takes extra effort to help with convergence, so please be attentive to any warnings that appear. Fixing any warnings will generally help convergence. For further help, see Problems with convergence in the Cox proportional hazard model.
from lifelines import CoxTimeVaryingFitter
ctv = CoxTimeVaryingFitter()
ctv.fit(df, id_col="id", event_col="event", start_col="start", stop_col="stop", show_progress=True)
ctv.print_summary()
ctv.plot()
Short note on predictionÂ¶
Unlike the other regression models, prediction in a timevarying setting is not trivial. To predict, we would need to know the covariates values beyond the observed times, but if we knew that, we would also know if the subject was still alive or not! However, it is still possible to compute the hazard values of subjects at known observations, the baseline cumulative hazard rate, and baseline survival function. So while CoxTimeVaryingFitter
exposes prediction methods, there are logicial limitations to what these predictions mean.
Model selection in survival regressionÂ¶
Model selection based on residualsÂ¶
The sections Testing the Proportional Hazard Assumptions and Assessing Cox model fit using residuals may be useful for modelling your data better.
Model selection based on predictive powerÂ¶
If censoring is present, itâ€™s not appropriate to use a loss function like meansquarederror or meanabsoluteloss. Instead, one measure is the concordanceindex, also known as the cindex. This measure evaluates the accuracy of the ordering of predicted time. It is infact a generalization of AUC, another common loss function, and is interpreted similarly:
 0.5 is the expected result from random predictions,
 1.0 is perfect concordance and,
 0.0 is perfect anticoncordance (multiply predictions with 1 to get 1.0)
A fitted modelâ€™s concordanceindex is present in the print_summary()
, but also available under the score_
property. Generally, the measure is implemented in lifelines under lifelines.utils.concordance_index
and accepts the actual times (along with any censorings) and the predicted times.
from lifelines import CoxPHFitter
from lifelines.datasets import load_rossi
rossi = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi, duration_col="week", event_col="arrest")
# method one
cph.print_summary()
# method two
print(cph.score_)
# method three
from lifelines.utils import concordance_index
concordance_index(rossi['week'], cph.predict_partial_hazard(rossi), rossi['arrest'])
However, there are other, arguably better, methods to measure the fit of a model. Included in print_summary
is the loglikelihood, which can be used in an AIC calculation, and the loglikelihood ratio statistic. Generally, I personally loved this article by Frank Harrell, â€śStatistically Efficient Ways to Quantify Added Predictive Value of New Measurementsâ€ť.
lifelines has an implementation of kfold cross validation under lifelines.utils.k_fold_cross_validation
. This function accepts an instance of a regression fitter (either CoxPHFitter
of AalenAdditiveFitter
), a dataset, plus k
(the number of folds to perform, default 5). On each fold, it splits the data
into a training set and a testing set fits itself on the training set and evaluates itself on the testing set (using the concordance measure).
from lifelines import CoxPHFitter
from lifelines.datasets import load_regression_dataset
from lifelines.utils import k_fold_cross_validation
regression_dataset = load_regression_dataset()
cph = CoxPHFitter()
scores = k_fold_cross_validation(cph, regression_dataset, 'T', event_col='E', k=3)
print(scores)
print(np.mean(scores))
print(np.std(scores))
#[ 0.5896 0.5358 0.5028]
# 0.542
# 0.035
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from matplotlib import pyplot as plt
from lifelines import CoxPHFitter
import numpy as np
import pandas as pd
Testing the proportional hazard assumptionsÂ¶
This Jupyter notebook is a small tutorial on how to test and fix proportional hazard problems.
The proportional hazard assumption is that all individuals have the same hazard function, but a unique scaling factor infront. So the shape of the hazard function is the same for all individuals, and only a scalar infront changes.
At the core of the assumption is that \(a_i\) is not time varying, that is, \(a_i(t) = a_i\).. In this tutorial we will test this nontime varying assumption, and look at ways to handle violations.
[2]:
from lifelines.datasets import load_rossi
rossi = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi, 'week', 'arrest')
[2]:
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
[3]:
cph.print_summary(model="untransformed variables")
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
number of subjects = 432
number of events = 114
loglikelihood = 658.75
time fit was run = 20190214 13:55:57 UTC
model = untransformed variables

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
fin 0.38 0.68 0.19 1.98 0.05 4.40 0.75 0.00
age 0.06 0.94 0.02 2.61 0.01 6.79 0.10 0.01
race 0.31 1.37 0.31 1.02 0.31 1.70 0.29 0.92
wexp 0.15 0.86 0.21 0.71 0.48 1.06 0.57 0.27
mar 0.43 0.65 0.38 1.14 0.26 1.97 1.18 0.31
paro 0.08 0.92 0.20 0.43 0.66 0.59 0.47 0.30
prio 0.09 1.10 0.03 3.19 <0.005 9.48 0.04 0.15

Concordance = 0.64
Likelihood ratio test = 33.27 on 7 df, log2(p)=15.37
Checking assumptions with check_assumptions
Â¶
New to lifelines 0.16.0 is the CoxPHFitter.check_assumptions
method. This method will compute statistics that check the proportional hazard assumption, produce plots to check assumptions, and more. Also included is an option to display advice to the console. Hereâ€™s a breakdown of each information displayed:
 Presented first are the results of a statistical test to test for any timevarying coefficients. A timevarying coefficient imply a covariateâ€™s influence relative to the baseline changes over time. This implies a violation of the proportional hazard assumption. For each variable, we transform time four times (these are common transformations of time to perform). If lifelines rejects the null (that is, lifelines rejects that the coefficient is not timevarying), we report this to the user.
 Some advice is presented on how to correct the proportional hazard violation based on some summary statistics of the variable.
 As a compliment to the above statistical test, for each variable that violates the PH assumption, visual plots of the the scaled Schoenfeld residuals is presented against the four time transformations. A fitted lowess is also presented, along with 10 bootstrapped lowess lines (as an approximation to the confidence interval of the original lowess line). Ideally, this lowess line is constant (flat). Deviations away from the constant line are violations of the PH assumption.
Why the scaled Schoenfeld residuals?Â¶
This section can be skipped on first read. Let \(s_{t,j}\) denote the scaled Schoenfeld residuals of variable \(j\) at time \(t\), \(\hat{\beta_j}\) denote the maximumlikelihood estimate of the \(j\)th variable, and \(\beta_j(t)\) a timevarying coefficient in (fictional) alternative model that allows for timevarying coefficients. Therneau and Grambsch showed that.
The proportional hazard assumption implies that \(\hat{\beta_j} = \beta_j(t)\), hence \(E[s_{t,j}] = 0\). This is what the above proportional hazard test is testing. Visually, plotting \(s_{t,j}\) over time (or some transform of time), is a good way to see violations of \(E[s_{t,j}] = 0\), along with the statisical test.
[4]:
cph.check_assumptions(rossi)
The ``p_value_threshold`` is set at 0.050. Even under the null hypothesis of no violations, some covariates
will be below the threshold (i.e. by chance). This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.
With that in mind, it's best to use a combination of statistical tests and eyeball tests to
determine the most serious violations.
<lifelines.StatisticalResult>
test_name = proportional_hazard_test
null_distribution = chi squared
degrees_of_freedom = 1

test_statistic p log2(p)
age km 11.03 <0.005 10.12
rank 11.09 <0.005 10.17
fin km 0.02 0.89 0.17
rank 0.02 0.90 0.16
mar km 0.60 0.44 1.19
rank 0.67 0.41 1.27
paro km 0.12 0.73 0.45
rank 0.14 0.71 0.49
prio km 0.02 0.88 0.18
rank 0.02 0.88 0.18
race km 1.44 0.23 2.12
rank 1.46 0.23 2.14
wexp km 7.48 0.01 7.32
rank 7.18 0.01 7.08
1. Variable 'age' failed the nonproportional test: pvalue is 0.0009.
Advice: try binning the variable 'age' using pd.cut, and then specify it in `strata=['age']` in the call in `.fit`. See documentation in link [A] and [B] below.
Advice: try adding an interaction term with your time variable. See documentation in link [A] and specifically link [C] below.
2. Variable 'wexp' failed the nonproportional test: pvalue is 0.0063.
Advice: with so few unique values (only 2), you can try `strata=['wexp']` in the call in `.fit`. See documentation in link [A] and [B] below.

[A] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
[B] https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html#checkingtheproportionalhazardsassumption
[C] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Option2:introducetimevaryingcovariates
Alternatively, you can use the proportional hazard test outside of check_assumptions
:
[5]:
from lifelines.statistics import proportional_hazard_test
results = proportional_hazard_test(cph, rossi, time_transform='rank')
results.print_summary(decimals=3, model="untransformed variables")
<lifelines.StatisticalResult>
test_name = proportional_hazard_test
time_transform = rank
null_distribution = chi squared
degrees_of_freedom = 1
model = untransformed variables

test_statistic p log2(p)
age 11.094 0.001 10.173
fin 0.017 0.896 0.158
mar 0.666 0.414 1.271
paro 0.138 0.711 0.493
prio 0.023 0.881 0.183
race 1.462 0.227 2.141
wexp 7.180 0.007 7.084
In the advice above, we can see that wexp
has small cardinality, so we can easily fix that by specifying it in the strata
. What does the strata
do? Letâ€™s go back to the proportional hazard assumption.
In the introduction, we said that the proportional hazard assumption was that
In a simple case, it may be that there are two subgroups that have very different baseline hazards. That is, we can split the dataset into subsamples based on some variable (we call this the stratifying variable), run the Cox model on all subsamples, and compare their baseline hazards. If these baseline hazards are very different, then clearly the formula above is wrong  the \(h(t)\) is some weighted average of the subgroupsâ€™ baseline hazards. This ill fitting average baseline can cause \(a_i\) to have timedependent influence. A better model might be:
where now we have a unique baseline hazard per subgroup \(G\). Because of the way the Cox model is designed, inference of the coefficients is identical (expect now there are more baseline hazards, and no variation of the stratifying variable within a subgroup \(G\)).
[6]:
cph.fit(rossi, 'week', 'arrest', strata=['wexp'])
cph.print_summary()
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
strata = ['wexp']
number of subjects = 432
number of events = 114
loglikelihood = 580.89
time fit was run = 20190214 13:55:57 UTC

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
fin 0.38 0.68 0.19 1.99 0.05 4.42 0.76 0.01
age 0.06 0.94 0.02 2.64 0.01 6.91 0.10 0.01
race 0.31 1.36 0.31 1.00 0.32 1.65 0.30 0.91
mar 0.45 0.64 0.38 1.19 0.23 2.09 1.20 0.29
paro 0.08 0.92 0.20 0.42 0.67 0.57 0.47 0.30
prio 0.09 1.09 0.03 3.16 <0.005 9.33 0.03 0.15

Concordance = 0.64
Likelihood ratio test = 188.99 on 6 df, log2(p)=124.17
[7]:
cph.check_assumptions(rossi)
The ``p_value_threshold`` is set at 0.050. Even under the null hypothesis of no violations, some covariates
will be below the threshold (i.e. by chance). This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.
With that in mind, it's best to use a combination of statistical tests and eyeball tests to
determine the most serious violations.
<lifelines.StatisticalResult>
test_name = proportional_hazard_test
null_distribution = chi squared
degrees_of_freedom = 1

test_statistic p log2(p)
age km 11.29 <0.005 10.32
rank 4.62 0.03 4.99
fin km 0.02 0.90 0.16
rank 0.05 0.83 0.28
mar km 0.53 0.47 1.10
rank 1.31 0.25 1.99
paro km 0.09 0.76 0.40
rank 0.00 0.97 0.05
prio km 0.02 0.89 0.16
rank 0.02 0.90 0.16
race km 1.47 0.23 2.15
rank 0.64 0.42 1.23
1. Variable 'age' failed the nonproportional test: pvalue is 0.0008.
Advice: try binning the variable 'age' using pd.cut, and then specify it in `strata=['age']` in the call in `.fit`. See documentation in link [A] and [B] below.
Advice: try adding an interaction term with your time variable. See documentation in link [A] and specifically link [C] below.

[A] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
[B] https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html#checkingtheproportionalhazardsassumption
[C] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Option2:introducetimevaryingcovariates
Since age
is still violating the proportional hazard assumption, we need to model it better. From the residual plots above, we can see a the effect of age start to become negative over time. This will be relevant later. Below, we present two options to handle age
.
Option 1: bin variable and stratify on itÂ¶
The first option proposed is to bin the variable into equalsized bins, and stratify like we did with wexp
. There is a trade off here between estimation and informationloss. If we have large bins, we will lose information (since different values are now binned together), but we need to estimate less new baseline hazards. On the other hand, with tiny bins, we allow the age
data to have the most â€świggle roomâ€ť, but must compute many baseline hazards each of which has a smaller sample size.
Like most things, the optimial value is somewhere inbetween.
[8]:
rossi_strata_age = rossi.copy()
rossi_strata_age['age_strata'] = pd.cut(rossi_strata_age['age'], np.arange(0, 80, 3))
rossi_strata_age[['age', 'age_strata']].head()
[8]:
age  age_strata  

0  27  (24, 27] 
1  18  (15, 18] 
2  19  (18, 21] 
3  23  (21, 24] 
4  19  (18, 21] 
[9]:
# drop the orignal, redundant, age column
rossi_strata_age = rossi_strata_age.drop('age', axis=1)
cph.fit(rossi_strata_age, 'week', 'arrest', strata=['age_strata', 'wexp'])
[9]:
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
[10]:
cph.print_summary(3, model="stratified age and wexp")
cph.plot()
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
strata = ['age_strata', 'wexp']
number of subjects = 432
number of events = 114
loglikelihood = 392.443
time fit was run = 20190214 13:55:58 UTC
model = stratified age and wexp

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
fin 0.395 0.674 0.197 2.004 0.045 4.472 0.781 0.009
race 0.280 1.324 0.313 0.895 0.371 1.431 0.334 0.895
mar 0.194 0.824 0.392 0.494 0.621 0.687 0.961 0.574
paro 0.163 0.849 0.200 0.818 0.413 1.275 0.555 0.228
prio 0.080 1.084 0.028 2.854 0.004 7.857 0.025 0.135

Concordance = 0.609
Likelihood ratio test = 565.874 on 5 df, log2(p)=396.379
[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f087de1c278>
[11]:
cph.check_assumptions(rossi_strata_age)
Proportional hazard assumption looks okay.
Option 2: introduce timevarying covariatesÂ¶
Our second option to correct variables that violate the proportional hazard assumption is to model the timevarying component directly. This is done in two steps. The first is to transform your dataset into episodic format. This means that we split a subject from a single row into \(n\) new rows, and each new row represents some time period for the subject. Itâ€™s okay that the variables are static over this new time periods  weâ€™ll introduce some timevarying covariates later.
See below for how to do this in lifelines:
[12]:
from lifelines.utils import to_episodic_format
# the time_gaps parameter specifies how large or small you want the periods to be.
rossi_long = to_episodic_format(rossi, duration_col='week', event_col='arrest', time_gaps=1.)
rossi_long.head(25)
[12]:
stop  start  arrest  age  fin  id  mar  paro  prio  race  wexp  

0  1.0  0.0  0  27  0  0  0  1  3  1  0 
1  2.0  1.0  0  27  0  0  0  1  3  1  0 
2  3.0  2.0  0  27  0  0  0  1  3  1  0 
3  4.0  3.0  0  27  0  0  0  1  3  1  0 
4  5.0  4.0  0  27  0  0  0  1  3  1  0 
5  6.0  5.0  0  27  0  0  0  1  3  1  0 
6  7.0  6.0  0  27  0  0  0  1  3  1  0 
7  8.0  7.0  0  27  0  0  0  1  3  1  0 
8  9.0  8.0  0  27  0  0  0  1  3  1  0 
9  10.0  9.0  0  27  0  0  0  1  3  1  0 
10  11.0  10.0  0  27  0  0  0  1  3  1  0 
11  12.0  11.0  0  27  0  0  0  1  3  1  0 
12  13.0  12.0  0  27  0  0  0  1  3  1  0 
13  14.0  13.0  0  27  0  0  0  1  3  1  0 
14  15.0  14.0  0  27  0  0  0  1  3  1  0 
15  16.0  15.0  0  27  0  0  0  1  3  1  0 
16  17.0  16.0  0  27  0  0  0  1  3  1  0 
17  18.0  17.0  0  27  0  0  0  1  3  1  0 
18  19.0  18.0  0  27  0  0  0  1  3  1  0 
19  20.0  19.0  1  27  0  0  0  1  3  1  0 
20  1.0  0.0  0  18  0  1  0  1  8  1  0 
21  2.0  1.0  0  18  0  1  0  1  8  1  0 
22  3.0  2.0  0  18  0  1  0  1  8  1  0 
23  4.0  3.0  0  18  0  1  0  1  8  1  0 
24  5.0  4.0  0  18  0  1  0  1  8  1  0 
Each subject is given a new id (but can be specified as well if already provided in the dataframe). This id is used to track subjects over time. Notice the arrest
col is 0 for all periods prior to their (possible) event as well.
Above I mentioned there were two steps to correct age
. The first was to convert to a episodic format. The second is to create an interaction term between age
and stop
. This is a timevarying variable.
Instead of CoxPHFitter
, we must use CoxTimeVaryingFitter
instead since we are working with a episodic dataset.
[13]:
rossi_long['time*age'] = rossi_long['age'] * rossi_long['stop']
[14]:
from lifelines import CoxTimeVaryingFitter
ctv = CoxTimeVaryingFitter()
ctv.fit(rossi_long,
id_col='id',
event_col='arrest',
start_col='start',
stop_col='stop',
strata=['wexp'])
[14]:
<lifelines.CoxTimeVaryingFitter: fitted with 19809 periods, 432 subjects, 114 events>
[15]:
ctv.print_summary(3)
<lifelines.CoxTimeVaryingFitter: fitted with 19809 periods, 432 subjects, 114 events>
event col = 'arrest'
strata = ['wexp']
number of subjects = 432
number of periods = 19809
number of events = 114
loglikelihood = 575.080
time fit was run = 20190214 13:56:03 UTC

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
age 0.073 1.075 0.040 1.830 0.067 3.893 0.005 0.151
fin 0.386 0.680 0.191 2.018 0.044 4.520 0.760 0.011
mar 0.397 0.672 0.382 1.039 0.299 1.743 1.147 0.352
paro 0.098 0.907 0.196 0.501 0.616 0.698 0.481 0.285
prio 0.090 1.094 0.029 3.152 0.002 9.267 0.034 0.146
race 0.295 1.343 0.308 0.955 0.340 1.558 0.310 0.899
time*age 0.005 0.995 0.002 3.337 0.001 10.203 0.008 0.002

Likelihood ratio test = 1150.160 on 7 df, log2(p)=0.000
[16]:
ctv.plot()
[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0844bc6978>
In the above scaled Schoenfeld residual plots for age
, we can see there is a slight negative effect for higher time values. This is confirmed in the output of the CoxTimeVaryingFitter
: we see that the coefficient for time*age
is 0.005.
ConclusionÂ¶
The point estimates and the standard errors are very close to each other using either option, we can feel confident that either approach is okay to proceed.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from matplotlib import pyplot as plt
from lifelines import CoxPHFitter
import numpy as np
import pandas as pd
from lifelines.datasets import load_rossi
plt.style.use('bmh')
Assessing Cox model fit using residuals (work in progress)Â¶
This tutorial is on some common use cases of the (many) residuals of the Cox model. We can use resdiuals to diagnose a modelâ€™s poor fit to a dataset, and improve an existing modelâ€™s fit.
[2]:
df = load_rossi()
df['age_strata'] = pd.cut(df['age'], np.arange(0, 80, 5))
df = df.drop('age', axis=1)
cph = CoxPHFitter()
cph.fit(df, 'week', 'arrest', strata=['age_strata', 'wexp'])
[2]:
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
[3]:
cph.print_summary()
cph.plot();
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
strata = ['age_strata', 'wexp']
number of subjects = 432
number of events = 114
loglikelihood = 434.50
time fit was run = 20190214 13:55:47 UTC

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
fin 0.41 0.67 0.19 2.10 0.04 4.82 0.79 0.03
race 0.29 1.33 0.31 0.93 0.35 1.50 0.32 0.90
mar 0.34 0.71 0.39 0.87 0.38 1.38 1.10 0.42
paro 0.10 0.91 0.20 0.50 0.62 0.70 0.48 0.29
prio 0.08 1.08 0.03 2.83 <0.005 7.73 0.02 0.14

Concordance = 0.61
Likelihood ratio test = 481.75 on 5 df, log2(p)=336.05
Martingale residualsÂ¶
Defined as:
where \(T_i\) is the total observation time of subject \(i\) and \(\delta_i\) denotes whether they died under observation of not (event_observed
in lifelines).
From [1]:
Martingale residuals take a value between \([1,â’\inf]\) for uncensored observations and \([0,â’\inf]\) for censored observations. Martingale residuals can be used to assess the true functional form of a particular covariate (Thernau et al. (1990)). It is often useful to overlay a LOESS curve over this plot as they can be noisy in plots with lots of observations. Martingale residuals can also be used to assess outliers in the data set whereby the survivor function predicts an event either too early or too late, however, itâ€™s often better to use the deviance residual for this.
From [2]:
Positive values mean that the patient died sooner than expected (according to the model); negative values mean that the patient lived longer than expected (or were censored).
[4]:
r = cph.compute_residuals(df, 'martingale')
r.head()
[4]:
week  arrest  martingale  

313  1.0  True  0.989383 
79  5.0  True  0.972812 
60  6.0  True  0.947727 
225  7.0  True  0.976976 
138  8.0  True  0.920273 
[5]:
r.plot.scatter(
x='week', y='martingale', c=np.where(r['arrest'], '#008fd5', '#fc4f30'),
alpha=0.75
)
[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7dc66b3be0>
Deviance residualsÂ¶
One problem with martingale residuals is that they are not symetric around 0. Deviance residuals are a transform of martingale residuals them symetric.
 Roughly symmetric around zero, with approximate standard deviation equal to 1.
 Positive values mean that the patient died sooner than expected.
 Negative values mean that the patient lived longer than expected (or were censored).
 Very large or small values are likely outliers.
[6]:
r = cph.compute_residuals(df, 'deviance')
r.head()
[6]:
week  arrest  deviance  

313  1.0  True  2.666807 
79  5.0  True  2.294411 
60  6.0  True  2.001769 
225  7.0  True  2.363998 
138  8.0  True  1.793808 
[7]:
r.plot.scatter(
x='week', y='deviance', c=np.where(r['arrest'], '#008fd5', '#fc4f30'),
alpha=0.75
)
[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7dc6642160>
[8]:
r = r.join(df.drop(['week', 'arrest'], axis=1))
[9]:
plt.scatter(r['prio'], r['deviance'], color=np.where(r['arrest'], '#008fd5', '#fc4f30'))
[9]:
<matplotlib.collections.PathCollection at 0x7f7dc65ba6a0>
[ ]:
[10]:
r = cph.compute_residuals(df, 'delta_beta')
r.head()
r = r.join(df[['week', 'arrest']])
r.head()
[10]:
fin  race  mar  paro  prio  week  arrest  

313  0.005650  0.011593  0.012142  0.027450  0.020486  1  1 
79  0.005761  0.005810  0.007687  0.020926  0.013372  5  1 
60  0.005783  0.000146  0.003277  0.014325  0.006315  6  1 
225  0.014998  0.041568  0.004855  0.002254  0.015725  7  1 
138  0.011572  0.005331  0.004241  0.013036  0.004405  8  1 
[11]:
plt.scatter(r['week'], r['prio'], color=np.where(r['arrest'], '#008fd5', '#fc4f30'))
[11]:
<matplotlib.collections.PathCollection at 0x7f7dc65225c0>
[ ]:
[2] http://myweb.uiowa.edu/pbreheny/7210/f15/notes/1110.pdf
Piecewise Exponential models and creating custom modelsÂ¶
This section will be easier if we recall our three mathematical â€ścreaturesâ€ť and the relationships between them. First is the survival function, \(S(t)\), that represents the probability of living past some time, \(t\). Next is the always nonnegative and nondereasing cumulative hazard function, \(H(t)\). Its relation to \(S(t)\) is:
Finally, the hazard function, \(h(t)\), is the derivative of the cumulative hazard:
which has the immediate relation to the survival function:
Notice that any of the three absolutely defines the other two. Some situations make it easier to define one vs the others. For example, in the Cox model, itâ€™s easist to work with the hazard, \(h(t)\). In this section on parametric univariate models, itâ€™ll be easiest to work with the cumulative hazard. This is because of an asymmetry in math: derivatives are much easier to compute that integrals. So, if we define the cumulative hazard, both the hazard and survival function are much easier to reason about vs if we define the hazard and ask questions about the other two. At first, it may be easier to think about the hazard, and thatâ€™s fine, but so long as we are clever enough to also determine the cumulative hazard, then we can ride the computational train. This will be clear by the end of the tutorial.
First, letâ€™s revisit some simplier parametric models.
The Exponential modelÂ¶
Recall that the Exponential model has a constant hazard, that is:
which implies that the cumulative hazard, \(H(t)\), has a pretty simple form: \(H(t) = \lambda t\). Below we fit this model to some survival data.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from lifelines.datasets import load_waltons
waltons = load_waltons()
T, E = waltons['T'], waltons['E']
[2]:
from lifelines import ExponentialFitter
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
epf = ExponentialFitter().fit(T, E)
epf.plot_hazard(ax=ax[0])
epf.plot_cumulative_hazard(ax=ax[1])
ax[0].set_title("hazard"); ax[1].set_title("cumulative_hazard")
epf.print_summary(3)
<lifelines.ExponentialFitter: fitted with 163 observations, 7 censored>
number of subjects = 163
number of events = 156
loglikelihood = 771.913
hypothesis = lambda_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
lambda_ 0.019 0.002 0.016 0.022 <0.0005 inf
This model does a poor job of fitting to our data. If I fit a nonparametric model, like the NelsonAalen model, to this data, the lack of fit is very obvious.
[3]:
from lifelines import NelsonAalenFitter
ax = epf.plot(figsize=(8,5))
naf = NelsonAalenFitter().fit(T, E)
ax = naf.plot(ax=ax)
plt.legend()
[3]:
<matplotlib.legend.Legend at 0x11a1f8908>
It should be clear that the single parameter model is just averaging the hazards over the entire time period. In reality though, the true hazard rate exhibits some complex nonlinear behaviour.
Piecewise Exponential modelsÂ¶
What if we could break out model into different time periods, and fit an exponential model to each of those? For example, we define the hazard as:
This model should be flexible enough to fit better to our dataset.
The cumulative hazard is only slightly more complicated, but not too much and can still be defined in Python. In lifelines, univariate models are constructed such that one only needs to define the cumulative hazard model with the parameters of interest, and all the hard work of fitting, creating confidence intervals, plotting, etc. is taken care.
For example, lifelines has implemented the PiecewiseExponentialFitter
model. Internally, the code is a single function that defines the cumulative hazard. The user specifies where they believe the â€śbreaksâ€ť are, and lifelines estimates the best \(\lambda_i\).
[4]:
from lifelines import PiecewiseExponentialFitter
# looking at the above plot, I think there may be breaks at t=40 and t=60.
pf = PiecewiseExponentialFitter(breakpoints=[40, 60]).fit(T, E)
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
ax = pf.plot(ax=axs[1])
pf.plot_hazard(ax=axs[0])
ax = naf.plot(ax=ax, ci_show=False)
axs[0].set_title("hazard"); axs[1].set_title("cumulative_hazard")
pf.print_summary(3)
<lifelines.PiecewiseExponentialFitter: fitted with 163 observations, 7 censored>
number of subjects = 163
number of events = 156
loglikelihood = 3.970
hypothesis = lambda_0_ != 1, lambda_1_ != 1, lambda_2_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
lambda_0_ 0.006 0.001 0.004 0.008 <0.0005 inf
lambda_1_ 0.032 0.004 0.024 0.040 <0.0005 inf
lambda_2_ 0.213 0.028 0.158 0.269 <0.0005 556.834
We can see a much better fit in this model. A quantitative measure of better fit is to compare the loglikelihood mthe exponential model and the piecewise exponential model (higher is better). The loglikelihood went from 771.913 to 3.970, respectively. We could keep going and add more and more breakpoints, but that would end up overfitting to the data.
Univarite models in lifelinesÂ¶
I mentioned that the PiecewiseExponentialFitter
was implemented using only its cumulative hazard function. This is not a lie. lifelines has very general semantics for univariate fitters. For example, this is how the entire ExponentialFitter
is implemented:
class ExponentialFitter(ParametericUnivariateFitter):
_fitted_parameter_names = ["lambda_"]
def _cumulative_hazard(self, params, times):
lambda_ = params[0]
return lambda_ * times
We only need to specify the cumulative hazard function because of the 11 relationship between the cumulative hazard function and the survival function and the hazard rate. From there, lifelines handles the rest.
Defining our own survival modelsÂ¶
To show off the flexability of lifelines univariate models, weâ€™ll create a brand new, never before seen, survival model. Looking at the NelsonAalen fit, the cumulative hazard looks looks like their might be an asymptote at \(t=80\). This may correspond to an absolute upper limit of subjectsâ€™ lives. Letâ€™s start with that functional form.
We subscript \(1\) because weâ€™ll investigate other models. In a lifelines univariate model, this is defined in the following code.
Important: in order to compute derivatives, you must use the numpy imported from the autograd
library. This is a thin wrapper around the original numpy. See import
below.
[5]:
from lifelines.fitters import ParametericUnivariateFitter
import autograd.numpy as np
class InverseTimeHazardFitter(ParametericUnivariateFitter):
# we tell the model what we want the names of the unknown parameters to be
_fitted_parameter_names = ['alpha_']
# this is the only function we need to define. It always takes two arguments:
# params: an iterable that unpacks the parameters you'll need in the order of _fitted_parameter_names
# times: avector of times that will be passed in.
def _cumulative_hazard(self, params, times):
alpha = params[0]
return alpha /(80  times)
[6]:
itf = InverseTimeHazardFitter()
itf.fit(T, E)
itf.print_summary()
ax = itf.plot(figsize=(8,5))
ax = naf.plot(ax=ax, ci_show=False)
plt.legend()
<lifelines.InverseTimeHazardFitter: fitted with 163 observations, 7 censored>
number of subjects = 163
number of events = 156
loglikelihood = 4.281
hypothesis = alpha_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
alpha_ 21.51 1.72 18.13 24.88 <0.005 106.22
[6]:
<matplotlib.legend.Legend at 0x11a58fb38>
The best fit of the model to the data is:
Our choice of 80 as an asymptote was maybe mistaken, so letâ€™s allow the asymptote to be another parameter:
If we define the model this way, we need to add a bound to the values that \(\beta\) can take. Obviously it canâ€™t be smaller than or equal to the maximum observed duration. Generally, the cumulative hazard must be positive and nondecreasing. Otherwise the model fit will hit convergence problems.
[7]:
class TwoParamInverseTimeHazardFitter(ParametericUnivariateFitter):
_fitted_parameter_names = ['alpha_', 'beta_']
# Sequence of (min, max) pairs for each element in x. None is used to specify no bound
_bounds = [(0, None), (75.0001, None)]
def _cumulative_hazard(self, params, times):
a, b = params
return a / (b  times)
[8]:
two_f = TwoParamInverseTimeHazardFitter()
two_f.fit(T, E)
two_f.print_summary()
ax = itf.plot(ci_show=False, figsize=(8,5))
ax = naf.plot(ax=ax, ci_show=False)
two_f.plot(ax=ax)
plt.legend()
<lifelines.TwoParamInverseTimeHazardFitter: fitted with 163 observations, 7 censored>
number of subjects = 163
number of events = 156
loglikelihood = 4.206
hypothesis = alpha_ != 1, beta_ != 76

coef se(coef) lower 0.95 upper 0.95 p log2(p)
alpha_ 16.50 1.51 13.55 19.46 <0.005 79.98
beta_ 76.55 0.38 75.80 77.30 0.15 2.73
[8]:
<matplotlib.legend.Legend at 0x11a859b38>
From the output, we see that the value of 76.55 is the suggested asymptote, that is:
The curve also appears to track against the NelsonAalen model better too. Letâ€™s try one additional parameter, \(\gamma\), some sort of measure of decay.
[9]:
from lifelines.fitters import ParametericUnivariateFitter
class ThreeParamInverseTimeHazardFitter(ParametericUnivariateFitter):
_fitted_parameter_names = ['alpha_', 'beta_', 'gamma_']
_bounds = [(0, None), (75.0001, None), (0, None)]
# this is the only function we need to define. It always takes two arguments:
# params: an iterable that unpacks the parameters you'll need in the order of _fitted_parameter_names
# times: a numpy vector of times that will be passed in by the optimizer
def _cumulative_hazard(self, params, times):
a, b, c = params
return a / (b  times) ** c
[10]:
three_f = ThreeParamInverseTimeHazardFitter()
three_f.fit(T, E)
three_f.print_summary()
ax = itf.plot(ci_show=False, figsize=(8,5))
ax = naf.plot(ax=ax, ci_show=False)
ax = two_f.plot(ax=ax, ci_show=False)
ax = three_f.plot(ax=ax)
plt.legend()
<lifelines.ThreeParamInverseTimeHazardFitter: fitted with 163 observations, 7 censored>
number of subjects = 163
number of events = 156
loglikelihood = 3.984
hypothesis = alpha_ != 1, beta_ != 76, gamma_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
alpha_ 1588776.28 3775137.44 5810357.13 8987909.70 0.67 0.57
beta_ 100.88 5.88 89.35 112.41 <0.005 15.38
gamma_ 3.83 0.50 2.85 4.81 <0.005 25.82
[10]:
<matplotlib.legend.Legend at 0x11ae3ba20>
Our new asymptote is at \(t\approx 100, \text{c.i.}=(87, 112)\). The model appears to fit the early times better than the previous models as well, however our \(\alpha\) parameter has more uncertainty now. Continuing to add parameters isnâ€™t advisable, as we will overfit to the data.
Why fit parametric models anyways?Â¶
Taking a step back, we are fitting parameteric models and comparing them to the nonparametric NelsonAalen. Why not just always use the NelsonAalen model?
 Sometimes we have scientific motivations to use a parametric model. That is, using domain knowledge, we may know the system has a parametric model and we wish to fit to that model.
 In a parametric model, we are borrowing information from all observations to determine the best parameters. To make this more clear, imagine take a single observation and changing itâ€™s value wildly. The fitted parameters would change as well. On the other hand, imagine doing the same for a nonparametric model. In this case, only the local survival function or hazard function would change. Because parametric models can borrow information from all observations, and there are much fewer unknowns than a nonparametric model, parametric models are said to be more statistical efficient.
 Extrapolation: nonparametric models are not easily extended to values outside the observed data. On the other hand, parametric models have no problem with this. However, extrapolation outside observed values is a very dangerous activity.
[11]:
fig, axs = plt.subplots(3, figsize=(7, 8), sharex=True)
new_timeline = np.arange(0, 85)
three_f = ThreeParamInverseTimeHazardFitter().fit(T, E, timeline=new_timeline)
three_f.plot_hazard(label='hazard', ax=axs[0]).legend()
three_f.plot_cumulative_hazard(label='cumulative hazard', ax=axs[1]).legend()
three_f.plot_survival_function(label='survival function', ax=axs[2]).legend()
fig.subplots_adjust(hspace=0)
# Hide x labels and tick labels for all but bottom plot.
for ax in axs:
ax.label_outer()
Looking for more examples of what you can build?Â¶
See other unique survival models in the docs on timelagged survival
[ ]:
Timelagged conversion rates and cure modelsÂ¶
Suppose in our population we have a subpopulation that will never experience the event of interest. Or, for some subjects the event will occur so far in the future that itâ€™s essentially at time infinity. The survival function should not asymptically approach zero, but some positive value. Models that describe this are sometimes called cure models or timelagged conversion models.
Thereâ€™s a serious fault in using parametric models for these types of problems that nonparametric models donâ€™t have. The most common parametric models like Weibull, LogNormal, etc. all have strictly increasing cumulative hazard functions, which means the corresponding survival function will always converge to 0.
Letâ€™s look at an example of this problem. Below I generated some data that has individuals who will not experience the event, not matter how long we wait.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from matplotlib import pyplot as plt
import autograd.numpy as np
from autograd.scipy.special import expit, logit
import pandas as pd
plt.style.use('bmh')
[2]:
N = 200
U = np.random.rand(N)
T = (logit(np.log(U) / 0.5)  np.random.exponential(2, N)  6.00) / 0.50
E = ~np.isnan(T)
T[np.isnan(T)] = 50
[3]:
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter().fit(T, E)
kmf.plot(figsize=(8,4))
plt.ylim(0, 1);
It should be clear that there is an asymptote at around 0.6. The nonparametric model will always show this. If this is true, then the cumulative hazard function should have a horizontal asymptote as well. Letâ€™s use the NelsonAalen model to see this.
[4]:
from lifelines import NelsonAalenFitter
naf = NelsonAalenFitter().fit(T, E)
naf.plot(figsize=(8,4))
[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x12342c400>
However, when we try a parametric model, we will see that it wonâ€™t extrapolate very well. Letâ€™s use the flexible two parameter LogLogisticFitter model.
[13]:
from lifelines import LogLogisticFitter
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 6))
t = np.linspace(0, 40)
llf = LogLogisticFitter().fit(T, E, timeline=t)
llf.plot_survival_function(ax=ax[0][0])
kmf.plot(ax=ax[0][0])
llf.plot_cumulative_hazard(ax=ax[0][1])
naf.plot(ax=ax[0][1])
t = np.linspace(0, 100)
llf = LogLogisticFitter().fit(T, E, timeline=t)
llf.plot_survival_function(ax=ax[1][0])
kmf.plot(ax=ax[1][0])
llf.plot_cumulative_hazard(ax=ax[1][1])
naf.plot(ax=ax[1][1])
[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x124f76ac8>
The LogLogistic model does a quite terrible job of capturing the not only the asymptotes, but also the intermediate values as well. If we extended the survival function out further, we would see that it eventually nears 0.
Custom parametric models to handle asymptotesÂ¶
Focusing on modeling the cumulative hazard function, what we would like is a function that increases up to a limit and then tapers off to an asymptote. We can think long and hard about these (I did), but thereâ€™s a family of functions that have this property that we are very familiar with: cumulative distribution functions! By their nature, they will asympotically approach 1. And, they are readily available in the SciPy and autograd libraries. So our new model of the cumulative hazard function is:
where \(c\) is the (unknown) horizontal asymptote, and \(\theta\) is a vector of (unknown) parameters for the CDF, \(F\).
We can create a custom cumulative hazard model using ParametericUnivariateFitter
. Letâ€™s choose the Normal CDF for our model.
[6]:
from autograd.scipy.stats import norm
from lifelines.fitters import ParametericUnivariateFitter
class UpperAsymptoteFitter(ParametericUnivariateFitter):
_fitted_parameter_names = ["c_", "mu_", "sigma_"]
_bounds = ((0, None), (None, None), (0, None))
def _cumulative_hazard(self, params, times):
c, mu, sigma = params
return c * norm.cdf((times  mu) / sigma, loc=0, scale=1)
[7]:
uaf = UpperAsymptoteFitter().fit(T, E)
uaf.print_summary(3)
uaf.plot(figsize=(8,4))
<lifelines.UpperAsymptoteFitter: fitted with 200 observations, 110 censored>
number of subjects = 200
number of events = 90
loglikelihood = 2.080
hypothesis = c_ != 1, mu_ != 0, sigma_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
c_ 0.604 0.065 0.477 0.730 <0.0005 30.173
mu_ 17.747 0.577 16.617 18.878 <0.0005 688.573
sigma_ 5.387 0.403 4.598 6.176 <0.0005 89.412
[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x123ab6c18>
We get a lovely asympotical cumulative hazard. The summary table suggests that the best value of \(c\) is 0.586. This can be translated into the survival function asymptote by \(\exp(0.586) \approx 0.56\).
Letâ€™s compare this fit to the nonparametric models.
[12]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 6))
t = np.linspace(0, 40)
uaf = UpperAsymptoteFitter().fit(T, E, timeline=t)
uaf.plot_survival_function(ax=ax[0][0])
kmf.plot(ax=ax[0][0])
uaf.plot_cumulative_hazard(ax=ax[0][1])
naf.plot(ax=ax[0][1])
t = np.linspace(0, 100)
uaf = UpperAsymptoteFitter().fit(T, E, timeline=t)
uaf.plot_survival_function(ax=ax[1][0])
kmf.survival_function_.plot(ax=ax[1][0])
uaf.plot_cumulative_hazard(ax=ax[1][1])
naf.plot(ax=ax[1][1])
[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1258ee048>
I wasnâ€™t expect this good of a fit. But there it is. This was some artificial data, but letâ€™s try this technique on some real life data.
[38]:
from lifelines.datasets import load_leukemia, load_kidney_transplant
T, E = load_leukemia()['t'], load_leukemia()['status']
uaf.fit(T, E)
ax = uaf.plot_survival_function(figsize=(8,4))
uaf.print_summary()
kmf.fit(T, E).plot(ax=ax, ci_show=False)
print("")
print("Estimated lower bound: {:.2f} ({:.2f}, {:.2f})".format(
np.exp(uaf.summary.loc['c_', 'coef']),
np.exp(uaf.summary.loc['c_', 'upper 0.95']),
np.exp(uaf.summary.loc['c_', 'lower 0.95']),
)
)
<lifelines.UpperAsymptoteFitter: fitted with 42 observations, 12 censored>
number of subjects = 42
number of events = 30
loglikelihood = 2.824
hypothesis = c_ != 1, mu_ != 0, sigma_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
c_ 1.63 0.36 0.94 2.33 0.07 3.75
mu_ 13.44 1.73 10.06 16.82 <0.005 47.07
sigma_ 7.03 1.07 4.94 9.12 <0.005 25.91

Estimated lower bound: 0.20 (0.10, 0.39)
So we might expect that about 20% will not have the even in this population (but make note of the large CI bounds too!)
[37]:
# Another, less obvious, dataset.
T, E = load_kidney_transplant()['time'], load_kidney_transplant()['death']
uaf.fit(T, E)
ax = uaf.plot_survival_function(figsize=(8,4))
uaf.print_summary()
kmf.fit(T, E).plot(ax=ax)
print("")
print("Estimated lower bound: {:.2f} ({:.2f}, {:.2f})".format(
np.exp(uaf.summary.loc['c_', 'coef']),
np.exp(uaf.summary.loc['c_', 'upper 0.95']),
np.exp(uaf.summary.loc['c_', 'lower 0.95']),
)
)
<lifelines.UpperAsymptoteFitter: fitted with 863 observations, 723 censored>
number of subjects = 863
number of events = 140
loglikelihood = 1.690
hypothesis = c_ != 1, mu_ != 0, sigma_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
c_ 0.29 0.03 0.24 0.35 <0.005 433.79
mu_ 1139.65 101.52 940.68 1338.62 <0.005 94.73
sigma_ 872.25 66.23 742.43 1002.06 <0.005 128.87

Estimated lower bound: 0.75 (0.70, 0.79)
[ ]:
More examples and recipesÂ¶
This section goes through some examples and recipes to help you use lifelines.
Statistically compare two populationsÂ¶
Often researchers want to compare survivalness between different populations. Here are some techniques to do that:
Subtraction and division between survival curvesÂ¶
If you are interested in taking the difference between two survival curves, simply trying to
subtract the survival_function_
will likely fail if the DataFrameâ€™s indexes are not equal. Fortunately,
the KaplanMeierFitter
and NelsonAalenFitter
have a builtin subtract
method:
kmf1.subtract(kmf2)
will produce the difference at every relevant time point. A similar function exists for division: divide
. However, for rigorous testing of differences, lifelines comes with a statistics library. See below.
Logrank testÂ¶
Note
The logrank test has maximum power when the assumption of proportional hazards is true. As a consquence, if the survival curves cross, the logrank test will give an inaccurate assessment of differences.
The lifelines.statistics.logrank_test
function compares whether the â€śdeathâ€ť generation process of the two populations are equal:
from lifelines.statistics import logrank_test
results = logrank_test(T1, T2, event_observed_A=E1, event_observed_B=E2)
results.print_summary()
"""
t_0 = 1
alpha = 0.95
null_distribution = chi squared
df = 1
use_bonferroni = True

test_statistic p
3.528 0.00034 **
"""
print(results.p_value) # 0.46759
print(results.test_statistic) # 0.528
If you have more than two populations, you can use pairwise_logrank_test
(which compares
each pair in the same manner as above), or multivariate_logrank_test
(which tests the
hypothesis that all the populations have the same â€śdeathâ€ť generation process).
from lifelines.statistics import multivariate_logrank_test
df = pd.DataFrame({
'durations': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'groups': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2], # could be strings too
'events': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
})
results = multivariate_logrank_test(df['durations'], df['groups'], df['events'])
results.print_summary()
"""
t_0 = 1
alpha = 0.95
null_distribution = chi squared
df = 2

test_statistic p
1.0800 0.5827

"""
Survival differences at a point in timeÂ¶
Often analysts want to compare the survivalness of groups at specific times, rather than comparing the entire survival curves against each other. For example, analysts may be interested in 5year survival. Statistically comparing the naive KaplanMeier points at a specific time
actually has reduced power. By transforming the KaplanMeier curve, we can recover more power. The function statistics.survival_difference_at_fixed_point_in_time_test
uses
the log(log) transformation implicitly and compares the survivalness of populations at a specific point in time.
from lifelines.statistics import survival_difference_at_fixed_point_in_time_test
results = survival_difference_at_fixed_point_in_time_test(point_in_time, T1, T2, event_observed_A=E1, event_observed_B=E2)
results.print_summary()
Model selection using lifelinesÂ¶
If using lifelines for prediction work, itâ€™s ideal that you perform some type of crossvalidation scheme. This crossvalidation allows you to be confident that your outofsample predictions will work well in practice. It also allows you to choose between multiple models.
lifelines has a builtin kfold crossvalidation function. For example, consider the following example:
from lifelines import AalenAdditiveFitter, CoxPHFitter
from lifelines.datasets import load_regression_dataset
from lifelines.utils import k_fold_cross_validation
df = load_regression_dataset()
#create the three models we'd like to compare.
aaf_1 = AalenAdditiveFitter(coef_penalizer=0.5)
aaf_2 = AalenAdditiveFitter(coef_penalizer=10)
cph = CoxPHFitter()
print(np.mean(k_fold_cross_validation(cph, df, duration_col='T', event_col='E')))
print(np.mean(k_fold_cross_validation(aaf_1, df, duration_col='T', event_col='E')))
print(np.mean(k_fold_cross_validation(aaf_2, df, duration_col='T', event_col='E')))
From these results, Aalenâ€™s Additive model with a penalizer of 10 is best model of predicting future survival times.
Plotting multiple figures on a plotÂ¶
When .plot
is called, an axis
object is returned which can be passed into future calls of .plot
:
kmf.fit(data1)
ax = kmf.plot()
kmf.fit(data2)
ax = kmf.plot(ax=ax)
If you have a pandas DataFrame with columns â€śgroupâ€ť, â€śTâ€ť, and â€śEâ€ť, then something like the following would work:
from lifelines import KaplanMeierFitter
from matplotlib import pyplot as plt
ax = plt.subplot(111)
kmf = KaplanMeierFitter()
for name, grouped_df in df.groupby('group'):
kmf.fit(grouped_df["T"], grouped_df["E"], label=name)
kmf.plot(ax=ax)
Plotting options and stylesÂ¶
Letâ€™s load some data
from lifelines.datasets import load_waltons
waltons = load_waltons()
T = waltons['T']
E = waltons['E']
StandardÂ¶
kmf = KaplanMeierFitter()
kmf.fit(T, E, label="kmf.plot()")
kmf.plot()
Show censors and edit markersÂ¶
kmf.fit(T, E, label="kmf.plot(show_censors=True, \ncensor_styles={'ms': 6, 'marker': 's'})")
kmf.plot(show_censors=True, censor_styles={'ms': 6, 'marker': 's'})
Hide confidence intervalsÂ¶
kmf.fit(T, E, label="kmf.plot(ci_show=False)")
kmf.plot(ci_show=False)
Invert axisÂ¶
kmf.fit(T, E, label="kmf.plot(invert_y_axis=True)")
kmf.plot(invert_y_axis=True)
Displaying atrisk counts below plotsÂ¶
kmf.fit(T, E, label="label name")
kmf.plot(at_risk_counts=True)
Displaying multiple atrisk counts below plotsÂ¶
The function add_at_risk_counts
in lifelines.plotting
allows you to add AtRisk counts at the bottom of your figures. For example:
from lifelines import KaplanMeierFitter
ix = waltons['group'] == 'control'
ax = plt.subplot(111)
kmf_control = KaplanMeierFitter()
ax = kmf_control.fit(waltons.loc[ix]['T'], waltons.loc[ix]['E'], label='control').plot(ax=ax)
kmf_exp = KaplanMeierFitter()
ax = kmf_exp.fit(waltons.loc[~ix]['T'], waltons.loc[~ix]['E'], label='exp').plot(ax=ax)
from lifelines.plotting import add_at_risk_counts
add_at_risk_counts(kmf_exp, kmf_control, ax=ax)
will display
Transforming survivaltable data into lifelines formatÂ¶
Some lifelines classes are designed for lists or arrays that represent one individual per row. If you instead have data in a survival table format, there exists a utility method to get it into lifelines format.
Example: Suppose you have a csv file with data that looks like this:
time  observed deaths  censored 

0  7  0 
1  1  1 
2  2  0 
3  1  2 
4  5  2 
â€¦  â€¦  â€¦ 
import pandas as pd
from lifelines.utils import survival_events_from_table
df = pd.read_csv('file.csv', columns = ['time', observed deaths', 'censored'])
df = df.set_index('time')
T, E = survival_events_from_table(df, observed_deaths_col='observed deaths', censored_col='censored')
print(T) # array([0,0,0,0,0,0,0,1,...])
print(E) # array([1,1,1,1,1,1,1,0,...])
Transforming observational data into survivaltable formatÂ¶
Perhaps you are interested in viewing the survival table given some durations and censoring vectors.
from lifelines.utils import survival_table_from_events
table = survival_table_from_events(T, E)
print(table.head())
"""
removed observed censored entrance at_risk
event_at
0 0 0 0 60 60
2 2 1 1 0 60
3 3 1 2 0 58
4 5 3 2 0 55
5 12 6 6 0 50
"""
Set the index/timeline of a estimateÂ¶
Suppose your dataset has lifetimes grouped near time 60, thus after fitting KaplanMeierFitter, you survival function might look something like:
print(kmf.survival_function_)
KMestimate
0 1.00
47 0.99
49 0.97
50 0.96
51 0.95
52 0.91
53 0.86
54 0.84
55 0.79
56 0.74
57 0.71
58 0.67
59 0.58
60 0.49
61 0.41
62 0.31
63 0.24
64 0.19
65 0.14
66 0.10
68 0.07
69 0.04
70 0.02
71 0.01
74 0.00
What you would like is to have a predictable and full index from 40 to 75. (Notice that
in the above index, the last two time points are not adjacent â€“ the cause is observing no lifetimes
existing for times 72 or 73). This is especially useful for comparing multiple survival functions at specific time points. To do this, all fitter methods accept a timeline
argument:
kmf.fit(T, timeline=range(40,75))
print(kmf.survival_function_)
KMestimate
40 1.00
41 1.00
42 1.00
43 1.00
44 1.00
45 1.00
46 1.00
47 0.99
48 0.99
49 0.97
50 0.96
51 0.95
52 0.91
53 0.86
54 0.84
55 0.79
56 0.74
57 0.71
58 0.67
59 0.58
60 0.49
61 0.41
62 0.31
63 0.24
64 0.19
65 0.14
66 0.10
67 0.10
68 0.07
69 0.04
70 0.02
71 0.01
72 0.01
73 0.01
74 0.00
lifelines will intelligently forwardfill the estimates to unseen time points.
Example SQL query to get survival data from a tableÂ¶
Below is a way to get an example dataset from a relational database (this may vary depending on your database):
SELECT
id,
DATEDIFF('dd', started_at, COALESCE(ended_at, CURRENT_DATE)) AS "T",
(ended_at IS NOT NULL) AS "E"
FROM table
ExplanationÂ¶
Each row is an id
, a duration, and a boolean indicating whether the event occurred or not. Recall that we denote a
â€śTrueâ€ť if the event did occur, that is, ended_at
is filled in (we observed the ended_at
). Ex:
id  T  E 

10  40  True 
11  42  False 
12  42  False 
13  36  True 
14  33  True 
Example SQL queries and transformations to get time varying dataÂ¶
For Cox timevarying models, we discussed what the dataset should look like in Dataset creation for timevarying regression. Typically we have a base dataset, and then we fold in the covariate datasets. Below are some SQL queries and Python transformations from endtoend.
Base dataset: base_df
Â¶
SELECT
id,
group,
DATEDIFF('dd', dt.started_at, COALESCE(dt.ended_at, CURRENT_DATE)) AS "T",
(ended_at IS NOT NULL) AS "E"
FROM dimension_table dt
Timevarying variables: cv
Â¶
 this could produce more than 1 row per subject
SELECT
id,
DATEDIFF('dd', dt.started_at, ft.event_at) AS "time",
ft.var1
FROM fact_table ft
JOIN dimension_table dt
USING(id)
from lifelines.utils import to_long_format
from lifelines.utils import add_covariate_to_timeline
base_df = to_long_format(base_df, duration_col="T")
df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="E")
Event variables: event_df
Â¶
Another very common operation is to add event data to our timevarying dataset. For example, a dataset/SQL table that contains information about the dates of an event (and NULLS if the event didnâ€™t occur). An example SQL query may look like:
SELECT
id,
DATEDIFF('dd', dt.started_at, ft.event1_at) AS "E1",
DATEDIFF('dd', dt.started_at, ft.event2_at) AS "E2",
DATEDIFF('dd', dt.started_at, ft.event3_at) AS "E3"
...
FROM dimension_table dt
In Pandas, this may look like:
id E1 E2 E3
0 1 1.0 NaN 2.0
1 2 NaN 5.0 NaN
2 3 3.0 5.0 7.0
...
Initially, this canâ€™t be added to our baseline timevarying dataset. Using utils.covariates_from_event_matrix
we can convert a dataframe like this into one that can be easily added.
from lifelines.utils import covariates_from_event_matrix
cv = covariates_from_event_matrix(event_df, id_col='id')
print(cv)
id duration E1 E2 E3
0 1 1.0 1 0 0
1 1 2.0 0 1 0
2 2 5.0 0 1 0
3 3 3.0 1 0 0
4 3 5.0 0 1 0
5 3 7.0 0 0 1
base_df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="E")
Example cumulative sums over timevarying covariatesÂ¶
Often we have either transactional covariate datasets or state covariate datasets. In a transactional dataset, it may make sense to sum up the covariates to represent administration of a treatment over time. For example, in the risky world of startups, we may want to sum up the funding amount recieved at a certain time. We also may be interested in the amount of the last round of funding. Below is an example to do just that:
Suppose we have an initial DataFrame of startups like:
seed_df = pd.DataFrame([
{'id': 'FB', 'E': True, 'T': 12, 'funding': 0},
{'id': 'SU', 'E': True, 'T': 10, 'funding': 0},
])
And a covariate dataframe representing funding rounds like:
cv = pd.DataFrame([
{'id': 'FB', 'funding': 30, 't': 5},
{'id': 'FB', 'funding': 15, 't': 10},
{'id': 'FB', 'funding': 50, 't': 15},
{'id': 'SU', 'funding': 10, 't': 6},
{'id': 'SU', 'funding': 9, 't': 10},
])
We can do the following to get both the cumulative funding recieved and the latest round of funding:
from lifelines.utils import to_long_format
from lifelines.utils import add_covariate_to_timeline
df = seed_df.pipe(to_long_format, 'T')\
.pipe(add_covariate_to_timeline, cv, 'id', 't', 'E', cumulative_sum=True)\
.pipe(add_covariate_to_timeline, cv, 'id', 't', 'E', cumulative_sum=False)
"""
start cumsum_funding funding stop id E
0 0 0.0 0.0 5.0 FB False
1 5 30.0 30.0 10.0 FB False
2 10 45.0 15.0 12.0 FB True
3 0 0.0 0.0 6.0 SU False
4 6 10.0 10.0 10.0 SU False
5 10 19.0 9.0 10.0 SU True
"""
Sample size determination under a CoxPH modelÂ¶
Suppose you wish to measure the hazard ratio between two populations under the CoxPH model. That is, we want to evaluate the hypothesis
H0: relative hazard ratio = 1 vs H1: relative hazard ratio != 1, where the relative hazard ratio is \(\exp{\left(\beta\right)}\) for the experiment group vs the control group. Apriori, we are interested in the sample sizes of the two groups necessary to achieve a certain statistical power. To do this in lifelines, there is the lifelines.statistics.sample_size_necessary_under_cph
function. For example:
from lifelines.statistics import sample_size_necessary_under_cph
desired_power = 0.8
ratio_of_participants = 1.
p_exp = 0.25
p_con = 0.35
postulated_hazard_ratio = 0.7
n_exp, n_con = sample_size_necessary_under_cph(desired_power, ratio_of_participants, p_exp, p_con, postulated_hazard_ratio)
# (421, 421)
This assumes you have estimates of the probability of event occuring for both the experiment and control group. This could be determined from previous experiments.
Power determination under a CoxPH modelÂ¶
Suppose you wish to measure the hazard ratio between two populations under the CoxPH model. To determine the statistical power of a hazard ratio hypothesis test, under the CoxPH model, we can use lifelines.statistics.power_under_cph
. That is, suppose we want to know the probability that we reject the null hypothesis that the relative hazard ratio is 1, assuming the relative hazard ratio is truely different from 1. This function will give you that probability.
from lifelines.statistics import power_under_cph
n_exp = 50
n_con = 100
p_exp = 0.25
p_con = 0.35
postulated_hazard_ratio = 0.5
power = power_under_cph(n_exp, n_con, p_exp, p_con, postulated_hazard_ratio)
# 0.4957
Problems with convergence in the Cox proportional hazard modelÂ¶
Since the estimation of the coefficients in the Cox proportional hazard model is done using the NewtonRaphson algorithm, there are sometimes problems with convergence. Here are some common symptoms and resolutions:
 First check: look for
ConvergenceWarning
in the output. Most often problems in convergence are the result of problems in the dataset. lifelines has checks it runs against the dataset before fitting and warnings are outputted to the user. delta contains nan value(s).
: First try addingshow_progress=True
in thefit
function. If the values indelta
grow unboundedly, itâ€™s possible thestep_size
is too large. Try setting it to a small value (0.10.5).Convergence halted due to matrix inversion problems
: This means that there is a linear combination in your dataset. That is, a column is equal to the linear combination of 1 or more other columns. Try to find the relationship by looking at the correlation matrix of your dataset. An common cause of this is dummifying categorical variables but not dropping a column. Some coefficients are many orders of magnitude larger than others, and the standard error of the coefficient is also large or there are
nan
â€™s in the results. This can be seen using theprint_summary
method on a fittedCoxPHFitter
object. Look for a
ConvergenceWarning
about variances being too small. The dataset may contain a constant column, which provides no information for the regression (Cox model doesnâ€™t have a traditional â€śinterceptâ€ť term like other regression models).  The data is completely separable, which means that there exists a covariate the completely determines whether an event occurred or not. For example, for all â€śdeathâ€ť events in the dataset, there exists a covariate that is constant amongst all of them. Look for a
ConvergenceWarning
after thefit
call.  Related to above, the relationship between a covariate and the duration may be completely determined. For example, if the rank correlation between a covariate and the duration is very close to 1 or 1, then the loglikelihood can be increased arbitrarly using just that covariate. Look for a
ConvergenceWarning
after thefit
call.  Another problem may be a colinear relationship in your dataset. See point 3. above.
 Look for a
 If adding a very small
penalizer
significantly changes the results (CoxPHFitter(penalizer=0.0001)
), then this probably means that the step size in the iterative algorithm is too large. Try decreasing it (.fit(..., step_size=0.50)
or smaller), and returning thepenalizer
term to 0.  If using the
strata
arugment, make sure your stratification group sizes are not too small. Trydf.groupby(strata).size()
.
Adding weights to observations in a Cox modelÂ¶
There are two common uses for weights in a model. The first is as a data size reduction technique (known as case weights). If the dataset has more than one subjects with identical attributes, including duration and event, then their likelihood contribution is the same as well. Thus, instead of computing the loglikelihood for each individual, we can compute it once and multiple it by the count of users with identical attributes. In practice, this involves first grouping subjects by covariates and counting. For example, using the Rossi dataset, we will use Pandas to group by the attributes (but other data processing tools, like Spark, could do this as well):
from lifelines.datasets import load_rossi
rossi = load_rossi()
rossi_weights = rossi.copy()
rossi_weights['weights'] = 1.
rossi_weights = rossi_weights.groupby(rossi.columns.tolist())['weights'].sum()\
.reset_index()
The original dataset has 432 rows, while the grouped dataset has 387 rows plus an additional weights
column. CoxPHFitter
has an additional parameter to specify which column is the weight column.
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(rossi_weights, 'week', 'arrest', weights_col='weights')
The fitting should be faster, and the results identical to the unweighted dataset. This option is also available in the CoxTimeVaryingFitter
.
The second use of weights is sampling weights. These are typically positive, noninteger weights that represent some artifical under/over sampling of observations (ex: inverse probability of treatment weights). It is recommened to set robust=True
in the call to the fit
as the usual standard error is incorrect for sampling weights. The robust
flag will use the sandwich estimator for the standard error.
Warning
The implementation of the sandwich estimator does not handle ties correctly (under the Efron handling of ties), and will give slightly or significantly different results from other software depending on the frequeny of ties.
Correlations between subjects in a Cox modelÂ¶
There are cases when your dataset contains correlated subjects, which breaks the independentandidenticallydistributed assumption. What are some cases when this may happen?
 If a subject appears more than once in the dataset (common when subjects can have the event more than once)
 If using a matching technique, like prospensityscore matching, there is a correlation between pairs.
In both cases, the reported standard errors from a unadjusted Cox model will be wrong. In order to adjust for these correlations, there is a cluster_col
keyword in CoxPHFitter.fit
that allows you to specify the column in the dataframe that contains designations for correlated subjects. For example, if subjects in rows 1 & 2 are correlated, but no other subjects are correlated, then cluster_col
column should have the same value for rows 1 & 2, and all others unique. Another example: for matched pairs, each subject in the pair should have the same value.
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
rossi = load_rossi()
# this may come from a database, or other libaries that specialize in matching
mathed_pairs = [
(156, 230),
(275, 228),
(61, 252),
(364, 201),
(54, 340),
(130, 33),
(183, 145),
(268, 140),
(332, 259),
(314, 413),
(330, 211),
(372, 255),
# ...
]
rossi['id'] = None # we will populate this column
for i, pair in enumerate(matched_pairs):
subjectA, subjectB = pair
rossi.loc[subjectA, 'id'] = i
rossi.loc[subjectB, 'id'] = i
rossi = rossi.dropna(subset=['id'])
cph = CoxPHFitter()
cph.fit(rossi, 'week', 'arrest', cluster_col='id')
Specifying cluster_col
will handle correlations, and invoke the robust sandwich estimator for standard errors (the same as setting robust=True
).
Reference library for lifelinesÂ¶
lifelines.fittersÂ¶
lifelines.fitters.aalen_additive_fitter moduleÂ¶

class
lifelines.fitters.aalen_additive_fitter.
AalenAdditiveFitter
(fit_intercept=True, alpha=0.95, coef_penalizer=0.0, smoothing_penalizer=0.0)Â¶ Bases:
lifelines.fitters.BaseFitter
This class fits the regression model:
\[h(tx) = b_0(t) + b_1(t) x_1 + ... + b_N(t) x_N\]that is, the hazard rate is a linear function of the covariates with timevarying coefficients. This implementation assumes nontimevarying covariates, see
TODO: name
Note
This class was rewritten in lifelines 0.17.0 to focus solely on static datasets. There is no guarantee of backwards compatibility.
Parameters:  fit_intercept (bool, optional (default: True)) â€“ If False, do not attach an intercept (column of ones) to the covariate matrix. The intercept, \(b_0(t)\) acts as a baseline hazard.
 alpha (float) â€“ the level in the confidence intervals.
 coef_penalizer (float, optional (default: 0)) â€“ Attach a L2 penalizer to the size of the coeffcients during regression. This improves stability of the estimates and controls for high correlation between covariates. For example, this shrinks the absolute value of \(c_{i,t}\).
 smoothing_penalizer (float, optional (default: 0)) â€“ Attach a L2 penalizer to difference between adjacent (over time) coefficents. For example, this shrinks the absolute value of \(c_{i,t}  c_{i,t+1}\).

fit
(df, duration_col, event_col=None, weights_col=None, show_progress=False)Â¶ Parameters: Fit the Aalen Additive model to a dataset.
Parameters:  df (DataFrame) â€“ a Pandas dataframe with necessary columns duration_col and event_col (see below), covariates columns, and special columns (weights). duration_col refers to the lifetimes of the subjects. event_col refers to whether the â€deathâ€™ events was observed: 1 if observed, 0 else (censored).
 duration_col (string) â€“ the name of the column in dataframe that contains the subjectsâ€™ lifetimes.
 event_col (string, optional) â€“ the name of thecolumn in dataframe that contains the subjectsâ€™ death observation. If left as None, assume all individuals are uncensored.
 weights_col (string, optional) â€“ an optional column in the dataframe, df, that denotes the weight per subject. This column is expelled and not used as a covariate, but as a weight in the final regression. Default weight is 1. This can be used for caseweights. For example, a weight of 2 means there were two subjects with identical observations. This can be used for sampling weights.
 show_progress (boolean, optional (default=False)) â€“ Since the fitter is iterative, show iteration number.
Returns: self â€“ self with additional new properties:
cumulative_hazards_
, etc.Return type: Examples
>>> from lifelines import AalenAdditiveFitter >>> >>> df = pd.DataFrame({ >>> 'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> 'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0], >>> 'var': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2], >>> 'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> }) >>> >>> aaf = AalenAdditiveFitter() >>> aaf.fit(df, 'T', 'E') >>> aaf.predict_median(df) >>> aaf.print_summary()

plot
(columns=None, loc=None, iloc=None, **kwargs)Â¶ â€ť A wrapper around plotting. Matplotlib plot arguments can be passed in, plus:
Parameters: columns (string or listlike, optional) â€“ If not empty, plot a subset of columns from the
cumulative_hazards_
. Default all.loc
iloc (slice, optional) â€“
 specify a locationbased subsection of the curves to plot, ex:
.plot(iloc=slice(0,10))
will plot the first 10 time points.

predict_cumulative_hazard
(X)Â¶ Returns the hazard rates for the individuals
Parameters: X (a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns) â€“ can be in any order. If a numpy array, columns must be in the same order as the training data.

predict_expectation
(X)Â¶ Compute the expected lifetime, E[T], using covariates X.
Parameters:  X (a (n,d) covariate numpy array or DataFrame) â€“ If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 Returns the expected lifetimes for the individuals

predict_median
(X)Â¶ Parameters:  X (a (n,d) covariate numpy array or DataFrame) â€“ If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 Returns the median lifetimes for the individuals

predict_percentile
(X, p=0.5)Â¶ Returns the median lifetimes for the individuals. http://stats.stackexchange.com/questions/102986/percentilelossfunctions
Parameters:  X (a (n,d) covariate numpy array or DataFrame) â€“ If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 p (float) â€“ default: 0.5

predict_survival_function
(X)Â¶ Returns the survival functions for the individuals
Parameters: X (a (n,d) covariate numpy array or DataFrame) â€“ If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

score_
Â¶ The concordance score (also known as the cindex) of the fit. The cindex is a generalization of the AUC to survival data, including censorships.
For this purpose, the
score_
is a measure of the predictive accuracy of the fitted model onto the training dataset. Itâ€™s analgous to the R^2 in linear models.

smoothed_hazards_
(bandwidth=1)Â¶ Using the epanechnikov kernel to smooth the hazard function, with sigma/bandwidth

summary
Â¶ Summary statistics describing the fit. Set alpha property in the object before calling.
Returns: df Return type: DataFrame
lifelines.fitters.aalen_johansen_fitter moduleÂ¶

class
lifelines.fitters.aalen_johansen_fitter.
AalenJohansenFitter
(jitter_level=0.0001, seed=None, alpha=0.95, calculate_variance=True)Â¶ Bases:
lifelines.fitters.UnivariateFitter
Class for fitting the AalenJohansen estimate for the cumulative incidence function in a competing risks framework. Treating competing risks as censoring can result in overestimated cumulative density functions. Using the Kaplan Meier estimator with competing risks as censored is akin to estimating the cumulative density if all competing risks had been prevented.
AalenJohansen cannot deal with tied times. We can get around this by randomy jittering the event times slightly. This will be done automatically and generates a warning.
Parameters:  alpha (float, option (default=0.95)) â€“ The alpha value associated with the confidence intervals.
 jitter_level (float, option (default=0.00001)) â€“ If tied event times are detected, event times are randomly changed by this factor.
 seed (int, option (default=None)) â€“ To produce replicate results with tied event times, the numpy.random.seed can be specified in the function.
 calculate_variance (bool, option (default=True)) â€“ By default, AalenJohansenFitter calculates the variance and corresponding confidence intervals. Due to how the variance is calculated, the variance must be calculated for each event time individually. This is computationally intensive. For some procedures, like bootstrapping, the variance is not necessary. To reduce computation time during these procedures, calculate_variance can be set to False to skip the variance calculation.
Example
>>> AalenJohansenFitter(alpha=0.95, jitter_level=0.00001, seed=None, calculate_variance=True)
References
If you are interested in learning more, we recommend the following openaccess paper; Edwards JK, Hester LL, Gokhale M, Lesko CR. Methodologic Issues When Estimating Risks in Pharmacoepidemiology. Curr Epidemiol Rep. 2016;3(4):285296.

conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

cumulative_hazard_at_times
(times, label=None)Â¶

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed, event_of_interest, timeline=None, entry=None, label='AJ_estimate', alpha=None, ci_labels=None, weights=None)Â¶ Parameters:  durations (an array or pd.Series of length n â€“ duration of subject was observed for)
 event_observed (an array, or pd.Series, of length n. Integer indicator of distinct events. Must be) â€“ only positive integers, where 0 indicates censoring.
 event_of_interest (integer â€“ indicator for event of interest. All other integers are considered competing events) â€“ Ex) event_observed contains 0, 1, 2 where 0:censored, 1:lung cancer, and 2:death. If event_of_interest=1, then death (2) is considered a competing event. The returned cumulative incidence function corresponds to risk of lung cancer
 timeline (return the best estimate at the values in timelines (postively increasing))
 entry (an array, or pd.Series, of length n â€“ relative time when a subject entered the study. This is) â€“ useful for lefttruncated (not leftcensored) observations. If None, all members of the population were born at time 0.
 label (a string to name the column of the estimate.)
 alpha (the alpha value in the confidence intervals. Overrides the initializing) â€“ alpha for this call to fit only.
 ci_labels (add custom column names to the generated confidence intervals) â€“ as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 weights (n array, or pd.Series, of length n, if providing a weighted dataset. For example, instead) â€“ of providing every subject as a single element of durations and event_observed, one could weigh subject differently.
Returns: self â€“ self, with new properties like
cumulative_incidence_
.Return type:

hazard_at_times
(times, label=None)Â¶

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

survival_function_at_times
(times, label=None)Â¶
lifelines.fitters.breslow_fleming_harrington_fitter moduleÂ¶

class
lifelines.fitters.breslow_fleming_harrington_fitter.
BreslowFlemingHarringtonFitter
(alpha=0.95)Â¶ Bases:
lifelines.fitters.UnivariateFitter
Class for fitting the BreslowFlemingHarrington estimate for the survival function. This estimator is a biased estimator of the survival function but is more stable when the popualtion is small and there are too few early truncation times, it may happen that is the number of patients at risk and the number of deaths is the same.
Mathematically, the NAF estimator is the negative logarithm of the BFH estimator.
BreslowFlemingHarringtonFitter(alpha=0.95)
Parameters: alpha (float) â€“ The alpha value associated with the confidence intervals. 
conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

cumulative_hazard_at_times
(times, label=None)Â¶

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed=None, timeline=None, entry=None, label='BFH_estimate', alpha=None, ci_labels=None)Â¶ Parameters:  durations (an array, or pd.Series, of length n) â€“ duration subject was observed for
 timeline â€“ return the best estimate at the values in timelines (postively increasing)
 event_observed (an array, or pd.Series, of length n) â€“ True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 entry (an array, or pd.Series, of length n) â€“ relative time when a subject entered the study. This is useful for lefttruncated observations, i.e the birth event was not observed. If None, defaults to all 0 (all birth events observed.)
 label (string) â€“ a string to name the column of the estimate.
 alpha (float) â€“ the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (iterable) â€“ add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
Returns: Return type: self, with new properties like
survival_function_
.

hazard_at_times
(times, label=None)Â¶

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

survival_function_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted survival value at specific times
Parameters: times (iterable or float) Returns: Return type: pd.Series

lifelines.fitters.cox_time_varying_fitter moduleÂ¶

class
lifelines.fitters.cox_time_varying_fitter.
CoxTimeVaryingFitter
(alpha=0.95, penalizer=0.0, strata=None)Â¶ Bases:
lifelines.fitters.BaseFitter
This class implements fitting Coxâ€™s timevarying proportional hazard model:
\[h(tx(t)) = h_0(t)*\exp(x(t)'*beta)\]Parameters:  alpha (float, optional) â€“ the level in the confidence intervals.
 penalizer (float, optional) â€“ the coefficient of an l2 penalizer in the regression

fit
(df, id_col, event_col, start_col='start', stop_col='stop', weights_col=None, show_progress=False, step_size=None, robust=False, strata=None)Â¶ Fit the Cox Propertional Hazard model to a time varying dataset. Tied survival times are handled using Efronâ€™s tiemethod.
Parameters:  df (DataFrame) â€“ a Pandas dataframe with necessary columns duration_col and event_col, plus other covariates. duration_col refers to the lifetimes of the subjects. event_col refers to whether the â€deathâ€™ events was observed: 1 if observed, 0 else (censored).
 id_col (string) â€“ A subject could have multiple rows in the dataframe. This column contains the unique identifer per subject.
 event_col (string) â€“ the column in dataframe that contains the subjectsâ€™ death observation. If left as None, assume all individuals are noncensored.
 start_col (string) â€“ the column that contains the start of a subjectâ€™s time period.
 stop_col (string) â€“ the column that contains the end of a subjectâ€™s time period.
 weights_col (string, optional) â€“ the column that contains (possibly timevarying) weight of each subjectperiod row.
 show_progress (since the fitter is iterative, show convergence) â€“ diagnostics.
 robust (boolean, optional (default: True)) â€“ Compute the robust errors using the Huber sandwich estimator, aka WeiLin estimate. This does not handle ties, so if there are high number of ties, results may significantly differ. See â€śThe Robust Inference for the Cox Proportional Hazards Modelâ€ť, Journal of the American Statistical Association, Vol. 84, No. 408 (Dec., 1989), pp. 1074 1078
 step_size (float, optional) â€“ set an initial step size for the fitting algorithm.
 strata (TODO)
Returns: self â€“ self, with additional properties like
hazards_
andprint_summary
Return type:

plot
(columns=None, **errorbar_kwargs)Â¶ Produces a visual representation of the coefficients, including their standard errors and magnitudes.
Parameters:  columns (list, optional) â€“ specifiy a subset of the columns to plot
 errorbar_kwargs â€“ pass in additional plotting commands to matplotlib errorbar command
Returns: ax â€“ the matplotlib axis that be edited.
Return type: matplotlib axis

predict_log_partial_hazard
(X)Â¶ This is equivalent to Râ€™s linear.predictors. Returns the log of the partial hazard for the individuals, partial since the baseline hazard is not included. Equal to \(\beta (X  \bar{X})\)
Parameters: X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns: Return type: DataFrame Note
If X is a dataframe, the order of the columns do not matter. But if X is an array, then the column ordering is assumed to be the same as the training dataset.

predict_partial_hazard
(X)Â¶ Returns the partial hazard for the individuals, partial since the baseline hazard is not included. Equal to \(\exp{\beta (X  \bar{X})}\)
Parameters: X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns: Return type: DataFrame Note
If X is a dataframe, the order of the columns do not matter. But if X is an array, then the column ordering is assumed to be the same as the training dataset.

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

summary
Â¶ Summary statistics describing the fit. Set alpha property in the object before calling.
Returns: df â€“ contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: DataFrame
lifelines.fitters.coxph_fitter moduleÂ¶

class
lifelines.fitters.coxph_fitter.
CoxPHFitter
(alpha=0.95, tie_method='Efron', penalizer=0.0, strata=None)Â¶ Bases:
lifelines.fitters.BaseFitter
This class implements fitting Coxâ€™s proportional hazard model:
\[h(tx) = h_0(t) \exp(x \beta)\]Parameters: alpha (float, optional (default=0.95)) â€“ the level in the confidence intervals.
tie_method (string, optional) â€“ specify how the fitter should deal with ties. Currently only â€Efronâ€™ is available.
penalizer (float, optional (default=0.0)) â€“ Attach a L2 penalizer to the size of the coeffcients during regression. This improves stability of the estimates and controls for high correlation between covariates. For example, this shrinks the absolute value of \(\beta_i\). Recommended, even if a small value. The penalty is \(1/2 \text{penalizer} beta^2\).
strata (list, optional) â€“
 specify a list of columns to use in stratification. This is useful if a
catagorical covariate does not obey the proportional hazard assumption. This is used similar to the strata expression in R. See http://courses.washington.edu/b515/l17.pdf.
Examples
>>> from lifelines.datasets import load_rossi >>> from lifelines import CoxPHFitter >>> rossi = load_rossi() >>> cph = CoxPHFitter() >>> cph.fit(rossi, 'week', 'arrest') >>> cph.print_summary()

check_assumptions
(training_df, advice=True, show_plots=False, p_value_threshold=0.05, plot_n_bootstraps=10)Â¶ Use this function to test the proportional hazards assumption. See iterative usage example at https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
Parameters:  training_df (DataFrame) â€“ the original DataFrame used in the call to
fit(...)
 advice (boolean, optional) â€“ display advice as output to the userâ€™s screen
 show_plots (boolean, optional) â€“ display plots of the scaled schoenfeld residuals and loess curves. This is an eyeball test for violations. This will slow down the function significantly.
 p_value_threshold (float, optional) â€“ the threshold to use to alert the user of violations. See note below.
 plot_n_bootstraps â€“ in the plots displayed, also display plot_n_bootstraps bootstrapped loess curves. This will slow down the function significantly.
Examples
>>> from lifelines.datasets import load_rossi >>> from lifelines import CoxPHFitter >>> >>> rossi = load_rossi() >>> cph = CoxPHFitter().fit(rossi, 'week', 'arrest') >>> >>> cph.check_assumptions(rossi)
Notes
The
p_value_threshold
is arbitrarily set at 0.05. Under the null, some covariates will be below the threshold (i.e. by chance). This is compounded when there are many covariates.Similarly, when there are lots of observations, even minor deviances from the proportional hazard assumption will be flagged.
With that in mind, itâ€™s best to use a combination of statistical tests and eyeball tests to determine the most serious violations.
References
section 5 in https://socialsciences.mcmaster.ca/jfox/Books/Companion/appendices/AppendixCoxRegression.pdf http://www.mwsug.org/proceedings/2006/stats/MWSUG2006SD08.pdf http://eprints.lse.ac.uk/84988/1/06_ParkHendry2015ReassessingSchoenfeldTests_Final.pdf
 training_df (DataFrame) â€“ the original DataFrame used in the call to

compute_residuals
(training_dataframe, kind)Â¶ Parameters:  training_dataframe (pandas DataFrame) â€“ the same training dataframe given in fit
 kind (string) â€“ {â€schoenfeldâ€™, â€scoreâ€™, â€delta_betaâ€™, â€devianceâ€™, â€martingaleâ€™, â€scaled_schoenfeldâ€™}

fit
(df, duration_col=None, event_col=None, show_progress=False, initial_beta=None, strata=None, step_size=None, weights_col=None, cluster_col=None, robust=False, batch_mode=None)Â¶ Fit the Cox Propertional Hazard model to a dataset.
Parameters:  df (DataFrame) â€“ a Pandas dataframe with necessary columns duration_col and event_col (see below), covariates columns, and special columns (weights, strata). duration_col refers to the lifetimes of the subjects. event_col refers to whether the â€deathâ€™ events was observed: 1 if observed, 0 else (censored).
 duration_col (string) â€“ the name of the column in dataframe that contains the subjectsâ€™ lifetimes.
 event_col (string, optional) â€“ the name of thecolumn in dataframe that contains the subjectsâ€™ death observation. If left as None, assume all individuals are uncensored.
 weights_col (string, optional) â€“ an optional column in the dataframe, df, that denotes the weight per subject. This column is expelled and not used as a covariate, but as a weight in the final regression. Default weight is 1. This can be used for caseweights. For example, a weight of 2 means there were two subjects with identical observations. This can be used for sampling weights. In that case, use robust=True to get more accurate standard errors.
 show_progress (boolean, optional (default=False)) â€“ since the fitter is iterative, show convergence diagnostics. Useful if convergence is failing.
 initial_beta (numpy array, optional) â€“ initialize the starting point of the iterative algorithm. Default is the zero vector.
 strata (list or string, optional) â€“ specify a column or list of columns n to use in stratification. This is useful if a catagorical covariate does not obey the proportional hazard assumption. This is used similar to the strata expression in R. See http://courses.washington.edu/b515/l17.pdf.
 step_size (float, optional) â€“ set an initial step size for the fitting algorithm.
 robust (boolean, optional (default=False)) â€“ Compute the robust errors using the Huber sandwich estimator, aka WeiLin estimate. This does not handle ties, so if there are high number of ties, results may significantly differ. See â€śThe Robust Inference for the Cox Proportional Hazards Modelâ€ť, Journal of the American Statistical Association, Vol. 84, No. 408 (Dec., 1989), pp. 1074 1078
 cluster_col (string, optional) â€“ specifies what column has unique identifers for clustering covariances. Using this forces the sandwich estimator (robust variance estimator) to be used.
 batch_mode (bool, optional) â€“ enabling batch_mode can be faster for datasets with a large number of ties. If left as None, lifelines will choose the best option.
Returns: self â€“ self with additional new properties:
print_summary
,hazards_
,confidence_intervals_
,baseline_survival_
, etc.Return type: Note
Tied survival times are handled using Efronâ€™s tiemethod.
Examples
>>> from lifelines import CoxPHFitter >>> >>> df = pd.DataFrame({ >>> 'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> 'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0], >>> 'var': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2], >>> 'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> }) >>> >>> cph = CoxPHFitter() >>> cph.fit(df, 'T', 'E') >>> cph.print_summary() >>> cph.predict_median(df)
>>> from lifelines import CoxPHFitter >>> >>> df = pd.DataFrame({ >>> 'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> 'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0], >>> 'var': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2], >>> 'weights': [1.1, 0.5, 2.0, 1.6, 1.2, 4.3, 1.4, 4.5, 3.0, 3.2, 0.4, 6.2], >>> 'month': [10, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> 'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> }) >>> >>> cph = CoxPHFitter() >>> cph.fit(df, 'T', 'E', strata=['month', 'age'], robust=True, weights_col='weights') >>> cph.print_summary() >>> cph.predict_median(df)

plot
(columns=None, **errorbar_kwargs)Â¶ Produces a visual representation of the coefficients, including their standard errors and magnitudes.
Parameters:  columns (list, optional) â€“ specify a subset of the columns to plot
 errorbar_kwargs â€“ pass in additional plotting commands to matplotlib errorbar command
Returns: ax â€“ the matplotlib axis that be edited.
Return type: matplotlib axis

plot_covariate_groups
(covariate, values, **kwargs)Â¶ Produces a visual representation comparing the baseline survival curve of the model versus what happens when a covariate is varied over values in a group. This is useful to compare subjectsâ€™ survival as we vary a single covariate, all else being held equal. The baseline survival curve is equal to the predicted survival curve at all average values in the original dataset.
Parameters:  covariate (string) â€“ a string of the covariate in the original dataset that we wish to vary.
 values (iterable) â€“ an iterable of the values we wish the covariate to take on.
 kwargs â€“ pass in additional plotting commands
Returns: ax â€“ the matplotlib axis that be edited.
Return type: matplotlib axis, or list of axisâ€™

predict_cumulative_hazard
(X, times=None)Â¶ Parameters:  X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 times (iterable, optional) â€“ an iterable of increasing times to predict the cumulative hazard at. Default is the set of all durations (observed and unobserved). Uses a linear interpolation if points in time are not in the index.
Returns: cumulative_hazard_ â€“ the cumulative hazard of individuals over the timeline
Return type: DataFrame

predict_expectation
(X)Â¶ Compute the expected lifetime, \(E[T]\), using covarites X. This algorithm to compute the expection is to use the fact that \(E[T] = \int_0^\inf P(T > t) dt = \int_0^\inf S(t) dt\). To compute the integal, we use the trapizoidal rule to approximate the integral.
However, if the survival function doesnâ€™t converge to 0, the the expectation is really infinity and the returned values are meaningless/too large. In that case, using
predict_median
orpredict_percentile
would be better.Parameters: X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns: expectations Return type: DataFrame Notes
If X is a dataframe, the order of the columns do not matter. But if X is an array, then the column ordering is assumed to be the same as the training dataset.
See also

predict_log_partial_hazard
(X)Â¶ This is equivalent to Râ€™s linear.predictors. Returns the log of the partial hazard for the individuals, partial since the baseline hazard is not included. Equal to \(\beta (X  mean(X_{train}))\)
Parameters: X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns: log_partial_hazard Return type: DataFrame Notes
If X is a dataframe, the order of the columns do not matter. But if X is an array, then the column ordering is assumed to be the same as the training dataset.

predict_median
(X)Â¶ Predict the median lifetimes for the individuals. If the survival curve of an individual does not cross 0.5, then the result is infinity.
Parameters: X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns: percentiles â€“ the median lifetimes for the individuals. If the survival curve of an individual does not cross 0.5, then the result is infinity. Return type: DataFrame See also

predict_partial_hazard
(X)Â¶ Parameters: X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns: partial_hazard â€“ Returns the partial hazard for the individuals, partial since the baseline hazard is not included. Equal to \(\exp{\beta (X  mean(X_{train}))}\) Return type: DataFrame Notes
If X is a dataframe, the order of the columns do not matter. But if X is an array, then the column ordering is assumed to be the same as the training dataset.

predict_percentile
(X, p=0.5)Â¶ Returns the median lifetimes for the individuals, by default. If the survival curve of an individual does not cross 0.5, then the result is infinity. http://stats.stackexchange.com/questions/102986/percentilelossfunctions
Parameters:  X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 p (float, optional (default=0.5)) â€“ the percentile, must be between 0 and 1.
Returns: percentiles
Return type: DataFrame
See also

predict_survival_function
(X, times=None)Â¶ Predict the survival function for individuals, given their covariates. This assumes that the individual just entered the study (that is, we do not condition on how long they have already lived for.)
Parameters:  X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 times (iterable, optional) â€“ an iterable of increasing times to predict the cumulative hazard at. Default is the set of all durations (observed and unobserved). Uses a linear interpolation if points in time are not in the index.
Returns: survival_function â€“ the survival probabilities of individuals over the timeline
Return type: DataFrame

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 alpha (float or iterable) â€“ specify confidence intervals to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

score_
Â¶ The concordance score (also known as the cindex) of the fit. The cindex is a generalization of the AUC to survival data, including censorships.
For this purpose, the
score_
is a measure of the predictive accuracy of the fitted model onto the training dataset. Itâ€™s analgous to the R^2 in linear models.

summary
Â¶ Summary statistics describing the fit. Set alpha property in the object before calling.
Returns: df â€“ Contains columns coef, np.exp(coef), se(coef), z, p, lower, upper Return type: DataFrame
lifelines.fitters.exponential_fitter moduleÂ¶

class
lifelines.fitters.exponential_fitter.
ExponentialFitter
(*args, **kwargs)Â¶ Bases:
lifelines.fitters.KnownModelParametericUnivariateFitter
This class implements an Exponential model for univariate data. The model has parameterized form:
\[S(t) = \exp(\lambda t), \lambda >0\]which implies the cumulative hazard rate is
\[H(t) = \lambda t\]and the hazard rate is:
\[h(t) = \lambda\]After calling the .fit method, you have access to properties like:
survival_function_
,lambda_
,cumulative_hazard_
A summary of the fit is available with the methodprint_summary()
Notes
Reference: https://www4.stat.ncsu.edu/~dzhang2/st745/chap3.pdf

conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

confidence_interval_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_cumulative_hazard_
.

confidence_interval_cumulative_hazard_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_
.

confidence_interval_hazard_
Â¶ The confidence interval of the hazard.

confidence_interval_survival_function_
Â¶ The confidence interval of the survival function.

cumulative_hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted cumulative hazard value at specific times.
Parameters:  times (iterable or float) â€“ values to return the cumulative hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None)Â¶ Parameters:  durations (an array, or pd.Series) â€“ length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) â€“ length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) â€“ return the estimate at the values in timeline (postively increasing)
 label (string, optional) â€“ a string to name the column of the estimate.
 alpha (float, optional) â€“ the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) â€“ add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) â€“ since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) â€“ relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were â€śbornâ€ť: time zero.
Returns: self â€“ self with new properties like
cumulative_hazard_
,survival_function_
Return type:

hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted hazard at specific times.
Parameters:  times (iterable or float) â€“ values to return the hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

median_
Â¶ Return the unique time point, t, such that S(t) = 0.5. This is the â€śhalflifeâ€ť of the population, and a robust summary statistic for the population, if it exists.

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

summary
Â¶ Summary statistics describing the fit.
Returns: df â€“ Contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: pd.DataFrame See also
print_summary

survival_function_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted survival value at specific times.
Parameters:  times (iterable or float) â€“ values to return the survival function at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

lifelines.fitters.kaplan_meier_fitter moduleÂ¶

class
lifelines.fitters.kaplan_meier_fitter.
KaplanMeierFitter
(alpha=0.95)Â¶ Bases:
lifelines.fitters.UnivariateFitter
Class for fitting the KaplanMeier estimate for the survival function.
Parameters: alpha (float, option (default=0.95)) â€“ The alpha value associated with the confidence intervals. Examples
>>> from lifelines import KaplanMeierFitter >>> from lifelines.datasets import load_waltons >>> waltons = load_waltons() >>> kmf = KaplanMeierFitter() >>> kmf.fit(waltons['T'], waltons['E']) >>> kmf.plot()

conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

cumulative_hazard_at_times
(times, label=None)Â¶

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed=None, timeline=None, entry=None, label='KM_estimate', alpha=None, left_censorship=False, ci_labels=None, weights=None)Â¶ Parameters:  durations (an array, or pd.Series, of length n â€“ duration subject was observed for)
 timeline (return the best estimate at the values in timelines (postively increasing))
 event_observed (an array, or pd.Series, of length n â€“ True if the the death was observed, False if the event) â€“ was lost (rightcensored). Defaults all True if event_observed==None
 entry (an array, or pd.Series, of length n â€“ relative time when a subject entered the study. This is) â€“ useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were â€śbornâ€ť.
 label (a string to name the column of the estimate.)
 alpha (the alpha value in the confidence intervals. Overrides the initializing) â€“ alpha for this call to fit only.
 left_censorship (True if durations and event_observed refer to left censorship events. Default False)
 ci_labels (add custom column names to the generated confidence intervals) â€“ as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 weights (n array, or pd.Series, of length n, if providing a weighted dataset. For example, instead) â€“ of providing every subject as a single element of durations and event_observed, one could weigh subject differently.
Returns: self â€“ self with new properties like
survival_function_
.Return type:

hazard_at_times
(times, label=None)Â¶

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_loglogs
(*args, **kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

survival_function_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted survival value at specific times
Parameters: times (iterable or float) Returns: Return type: pd.Series

lifelines.fitters.log_logistic_fitter moduleÂ¶

class
lifelines.fitters.log_logistic_fitter.
LogLogisticFitter
(*args, **kwargs)Â¶ Bases:
lifelines.fitters.KnownModelParametericUnivariateFitter
This class implements a LogLogistic model for univariate data. The model has parameterized form:
\[S(t) = \left(1 + \left(\frac{t}{\alpha}\right)^{\beta}\right)^{1}, \alpha > 0, \beta > 0,\]and the hazard rate is:
\[h(t) = \frac{\left(\frac{\beta}{\alpha}\right)\left(\frac{t}{\alpha}\right) ^ {\beta1}}{\left(1 + \left(\frac{t}{\alpha}\right)^{\beta}\right)}\]and the cumulative hazard is:
\[H(t) = \log\left(\left(\frac{t}{\alpha}\right) ^ {\beta} + 1\right)\]After calling the .fit method, you have access to properties like:
cumulative_hazard_
,plot
,survival_function_
,alpha_
andbeta_
. A summary of the fit is available with the method â€print_summary()â€™Examples
>>> from lifelines import LogLogisticFitter >>> from lifelines.datasets import load_waltons >>> waltons = load_waltons() >>> llf = WeibullFitter() >>> llf.fit(waltons['T'], waltons['E']) >>> llf.plot() >>> print(llf.alpha_)

conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

confidence_interval_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_cumulative_hazard_
.

confidence_interval_cumulative_hazard_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_
.

confidence_interval_hazard_
Â¶ The confidence interval of the hazard.

confidence_interval_survival_function_
Â¶ The confidence interval of the survival function.

cumulative_hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted cumulative hazard value at specific times.
Parameters:  times (iterable or float) â€“ values to return the cumulative hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None)Â¶ Parameters:  durations (an array, or pd.Series) â€“ length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) â€“ length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) â€“ return the estimate at the values in timeline (postively increasing)
 label (string, optional) â€“ a string to name the column of the estimate.
 alpha (float, optional) â€“ the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) â€“ add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) â€“ since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) â€“ relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were â€śbornâ€ť: time zero.
Returns: self â€“ self with new properties like
cumulative_hazard_
,survival_function_
Return type:

hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted hazard at specific times.
Parameters:  times (iterable or float) â€“ values to return the hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

median_
Â¶ Return the unique time point, t, such that S(t) = 0.5. This is the â€śhalflifeâ€ť of the population, and a robust summary statistic for the population, if it exists.

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

summary
Â¶ Summary statistics describing the fit.
Returns: df â€“ Contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: pd.DataFrame See also
print_summary

survival_function_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted survival value at specific times.
Parameters:  times (iterable or float) â€“ values to return the survival function at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

lifelines.fitters.log_normal_fitter moduleÂ¶

class
lifelines.fitters.log_normal_fitter.
LogNormalFitter
(*args, **kwargs)Â¶ Bases:
lifelines.fitters.KnownModelParametericUnivariateFitter
This class implements an Log Normal model for univariate data. The model has parameterized form:
\[S(t) = 1  \Phi((\log(t)  \mu)/\sigma), \sigma >0\]where \(\Phi\) is the CDF of a standard normal random variable. This implies the cumulative hazard rate is
\[H(t) = \log(1  \Phi((\log(t)  \mu)/\sigma))\]After calling the .fit method, you have access to properties like:
survival_function_
,mu_
,sigma_
. A summary of the fit is available with the methodprint_summary()

conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

confidence_interval_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_cumulative_hazard_
.

confidence_interval_cumulative_hazard_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_
.

confidence_interval_hazard_
Â¶ The confidence interval of the hazard.

confidence_interval_survival_function_
Â¶ The confidence interval of the survival function.

cumulative_hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted cumulative hazard value at specific times.
Parameters:  times (iterable or float) â€“ values to return the cumulative hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None)Â¶ Parameters:  durations (an array, or pd.Series) â€“ length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) â€“ length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) â€“ return the estimate at the values in timeline (postively increasing)
 label (string, optional) â€“ a string to name the column of the estimate.
 alpha (float, optional) â€“ the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) â€“ add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) â€“ since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) â€“ relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were â€śbornâ€ť: time zero.
Returns: self â€“ self with new properties like
cumulative_hazard_
,survival_function_
Return type:

hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted hazard at specific times.
Parameters:  times (iterable or float) â€“ values to return the hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

median_
Â¶ Return the unique time point, t, such that S(t) = 0.5. This is the â€śhalflifeâ€ť of the population, and a robust summary statistic for the population, if it exists.

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

summary
Â¶ Summary statistics describing the fit.
Returns: df â€“ Contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: pd.DataFrame See also
print_summary

survival_function_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted survival value at specific times.
Parameters:  times (iterable or float) â€“ values to return the survival function at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

lifelines.fitters.nelson_aalen_fitter moduleÂ¶

class
lifelines.fitters.nelson_aalen_fitter.
NelsonAalenFitter
(alpha=0.95, nelson_aalen_smoothing=True)Â¶ Bases:
lifelines.fitters.UnivariateFitter
Class for fitting the NelsonAalen estimate for the cumulative hazard.
NelsonAalenFitter(alpha=0.95, nelson_aalen_smoothing=True)
Parameters:  alpha (float, optional) â€“ The alpha value associated with the confidence intervals.
 nelson_aalen_smoothing (bool, optional) â€“ If the event times are naturally discrete (like discrete years, minutes, etc.) then it is advisable to turn this parameter to False. See [1], pg.84.
Notes
[1] Aalen, O., Borgan, O., Gjessing, H., 2008. Survival and Event History Analysis

conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

cumulative_hazard_at_times
(times, label=None)Â¶

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed=None, timeline=None, entry=None, label='NA_estimate', alpha=None, ci_labels=None, weights=None)Â¶ Parameters:  durations (an array, or pd.Series, of length n) â€“ duration subject was observed for
 timeline (iterable) â€“ return the best estimate at the values in timelines (postively increasing)
 event_observed (an array, or pd.Series, of length n) â€“ True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 entry (an array, or pd.Series, of length n) â€“ relative time when a subject entered the study. This is useful for lefttruncated observations, i.e the birth event was not observed. If None, defaults to all 0 (all birth events observed.)
 label (string) â€“ a string to name the column of the estimate.
 alpha (float) â€“ the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (iterable) â€“ add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 weights (n array, or pd.Series, of length n) â€“ if providing a weighted dataset. For example, instead of providing every subject as a single element of durations and event_observed, one could weigh subject differently.
Returns: Return type: self, with new properties like
cumulative_hazard_
.

hazard_at_times
(times, label=None)Â¶

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

smoothed_hazard_
(bandwidth)Â¶ Parameters: bandwidth (float) â€“ the bandwith used in the Epanechnikov kernel. Returns: a DataFrame of the smoothed hazard Return type: DataFrame

smoothed_hazard_confidence_intervals_
(bandwidth, hazard_=None)Â¶ Parameters:  bandwidth (float) â€“ the bandwith to use in the Epanechnikov kernel. > 0
 hazard_ (numpy array) â€“ a computed (n,) numpy array of estimated hazard rates. If none, uses
smoothed_hazard_

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

survival_function_at_times
(times, label=None)Â¶
lifelines.fitters.piecewise_exponential_fitter moduleÂ¶

class
lifelines.fitters.piecewise_exponential_fitter.
PiecewiseExponentialFitter
(breakpoints, *args, **kwargs)Â¶ Bases:
lifelines.fitters.KnownModelParametericUnivariateFitter
This class implements an Piecewise Exponential model for univariate data. The model has parameterized hazard rate:
\[\begin{split}h(t) = \begin{cases} \lambda_0, & \text{if $t \le \tau_0$} \\ \lambda_1 & \text{if $\tau_0 < t \le \tau_1$} \\ \lambda_2 & \text{if $\tau_1 < t \le \tau_2$} \\ ... \end{cases}\end{split}\]You specify the breakpoints, \(\tau_i\), and lifelines will find the optional values for the parameters.
After calling the .fit method, you have access to properties like:
survival_function_
,plot
,cumulative_hazard_
A summary of the fit is available with the methodprint_summary()

conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

confidence_interval_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_cumulative_hazard_
.

confidence_interval_cumulative_hazard_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_
.

confidence_interval_hazard_
Â¶ The confidence interval of the hazard.

confidence_interval_survival_function_
Â¶ The confidence interval of the survival function.

cumulative_hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted cumulative hazard value at specific times.
Parameters:  times (iterable or float) â€“ values to return the cumulative hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None)Â¶ Parameters:  durations (an array, or pd.Series) â€“ length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) â€“ length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) â€“ return the estimate at the values in timeline (postively increasing)
 label (string, optional) â€“ a string to name the column of the estimate.
 alpha (float, optional) â€“ the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) â€“ add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) â€“ since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) â€“ relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were â€śbornâ€ť: time zero.
Returns: self â€“ self with new properties like
cumulative_hazard_
,survival_function_
Return type:

hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted hazard at specific times.
Parameters:  times (iterable or float) â€“ values to return the hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

median_
Â¶ Return the unique time point, t, such that S(t) = 0.5. This is the â€śhalflifeâ€ť of the population, and a robust summary statistic for the population, if it exists.

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

summary
Â¶ Summary statistics describing the fit.
Returns: df â€“ Contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: pd.DataFrame See also
print_summary

survival_function_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted survival value at specific times.
Parameters:  times (iterable or float) â€“ values to return the survival function at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

lifelines.fitters.weibull_fitter moduleÂ¶

class
lifelines.fitters.weibull_fitter.
WeibullFitter
(*args, **kwargs)Â¶ Bases:
lifelines.fitters.KnownModelParametericUnivariateFitter
This class implements a Weibull model for univariate data. The model has parameterized form:
\[S(t) = \exp((\lambda t)^\rho), \lambda > 0, \rho > 0,\]which implies the cumulative hazard rate is
\[H(t) = (\lambda t)^\rho,\]and the hazard rate is:
\[h(t) = \rho \lambda(\lambda t)^{\rho1}\]After calling the .fit method, you have access to properties like:
cumulative_hazard_
,survival_function_
,lambda_
andrho_
. A summary of the fit is available with the methodprint_summary()
.Examples
>>> from lifelines import WeibullFitter >>> from lifelines.datasets import load_waltons >>> waltons = load_waltons() >>> wbf = WeibullFitter() >>> wbf.fit(waltons['T'], waltons['E']) >>> wbf.plot() >>> print(wbf.lambda_)

conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

confidence_interval_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_cumulative_hazard_
.

confidence_interval_cumulative_hazard_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_
.

confidence_interval_hazard_
Â¶ The confidence interval of the hazard.

confidence_interval_survival_function_
Â¶ The confidence interval of the survival function.

cumulative_hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted cumulative hazard value at specific times.
Parameters:  times (iterable or float) â€“ values to return the cumulative hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None)Â¶ Parameters:  durations (an array, or pd.Series) â€“ length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) â€“ length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) â€“ return the estimate at the values in timeline (postively increasing)
 label (string, optional) â€“ a string to name the column of the estimate.
 alpha (float, optional) â€“ the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) â€“ add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) â€“ since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) â€“ relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were â€śbornâ€ť: time zero.
Returns: self â€“ self with new properties like
cumulative_hazard_
,survival_function_
Return type:

hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted hazard at specific times.
Parameters:  times (iterable or float) â€“ values to return the hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

median_
Â¶ Return the unique time point, t, such that S(t) = 0.5. This is the â€śhalflifeâ€ť of the population, and a robust summary statistic for the population, if it exists.

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

summary
Â¶ Summary statistics describing the fit.
Returns: df â€“ Contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: pd.DataFrame See also
print_summary

survival_function_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted survival value at specific times.
Parameters:  times (iterable or float) â€“ values to return the survival function at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

lifelines.utilsÂ¶

lifelines.utils.
qth_survival_times
(q, survival_functions, cdf=False)Â¶ Find the times when one or more survival functions reach the qth percentile.
Parameters:  q (float) â€“ a float between 0 and 1 that represents the time when the survival function hits the qth percentile.
 survival_functions (a (n,d) dataframe or numpy array.) â€“ If dataframe, will return index values (actual times) If numpy array, will return indices.
 cdf (boolean, optional) â€“ When doing leftcensored data, cdf=True is used.
Returns: if d==1, returns a float, np.inf if infinity. if d > 1, an DataFrame containing the first times the value was crossed.
Return type: float, or DataFrame
See also

lifelines.utils.
qth_survival_time
(q, survival_function, cdf=False)Â¶ Returns the time when a single survival function reachess the qth percentile.
Parameters:  q (float) â€“ a float between 0 and 1 that represents the time when the survival function hitâ€™s the qth percentile.
 survival_function (Series or singlecolumn DataFrame.)
 cdf (boolean, optional) â€“ When doing leftcensored data, cdf=True is used.
Returns: Return type: float
See also

lifelines.utils.
median_survival_times
(density_or_survival_function, left_censorship=False)Â¶

lifelines.utils.
survival_table_from_events
(death_times, event_observed, birth_times=None, columns=['removed', 'observed', 'censored', 'entrance', 'at_risk'], weights=None, collapse=False, intervals=None)Â¶ Parameters:  death_times ((n,) array) â€“ represent the event times
 event_observed ((n,) array) â€“ 1 if observed event, 0 is censored event.
 birth_times (a (n,) array, optional) â€“ representing when the subject was first observed. A subjectâ€™s death event is then at [birth times + duration observed]. If None (default), birth_times are set to be the first observation or 0, which ever is smaller.
 columns (interable, optional) â€“ a 3length array to call the, in order, removed individuals, observed deaths and censorships.
 weights ((n,1) array, optional) â€“ Optional argument to use weights for individuals. Assumes weights of 1 if not provided.
 collapse (boolean, optional (default=False)) â€“ If True, collapses survival table into lifetable to show events in interval bins
 intervals (iterable, optional) â€“ Default None, otherwise a list/(n,1) array of interval edge measures. If left as None while collapse=True, then FreedmanDiaconis rule for histogram bins will be used to determine intervals.
Returns: Pandas DataFrame with index as the unique times or intervals in event_times. The columns named â€removedâ€™ refers to the number of individuals who were removed from the population by the end of the period. The column â€observedâ€™ refers to the number of removed individuals who were observed to have died (i.e. not censored.) The column â€censoredâ€™ is defined as â€removedâ€™  â€observedâ€™ (the number of individuals who left the population due to event_observed)
Return type: DataFrame
Example
>>> #Uncollapsed output >>> removed observed censored entrance at_risk >>> event_at >>> 0 0 0 0 11 11 >>> 6 1 1 0 0 11 >>> 7 2 2 0 0 10 >>> 9 3 3 0 0 8 >>> 13 3 3 0 0 5 >>> 15 2 2 0 0 2 >>> #Collapsed output >>> removed observed censored at_risk >>> sum sum sum max >>> event_at >>> (0, 2] 34 33 1 312 >>> (2, 4] 84 42 42 278 >>> (4, 6] 64 17 47 194 >>> (6, 8] 63 16 47 130 >>> (8, 10] 35 12 23 67 >>> (10, 12] 24 5 19 32
See also

lifelines.utils.
group_survival_table_from_events
(groups, durations, event_observed, birth_times=None, limit=1)Â¶ Joins multiple event series together into dataframes. A generalization of survival_table_from_events to data with groups. Previously called group_event_series pre 0.2.3.
Parameters:  groups (a (n,) array) â€“ individualsâ€™ group ids.
 durations (a (n,) array) â€“ durations of each individual
 event_observed (a (n,) array) â€“ event observations, 1 if observed, 0 else.
 birth_times (a (n,) array) â€“ when the subject was first observed. A subjectâ€™s death event is then at [birth times + duration observed]. Normally set to all zeros, but can be positive or negative.
 limit
Returns:  unique_groups (np.array) â€“ array of all the unique groups present
 removed (DataFrame) â€“ dataframe of removal count data at event_times for each group, column names are â€removed:<group name>â€™
 observed (DataFrame) â€“ dataframe of observed count data at event_times for each group, column names are â€observed:<group name>â€™
 censored (DataFrame) â€“ dataframe of censored count data at event_times for each group, column names are â€censored:<group name>â€™
Example
>>> #input >>> group_survival_table_from_events(waltonG, waltonT, np.ones_like(waltonT)) #data available in test_suite.py >>> #output >>> [ >>> array(['control', 'miR137'], dtype=object), >>> removed:control removed:miR137 >>> event_at >>> 6 0 1 >>> 7 2 0 >>> 9 0 3 >>> 13 0 3 >>> 15 0 2 >>> , >>> observed:control observed:miR137 >>> event_at >>> 6 0 1 >>> 7 2 0 >>> 9 0 3 >>> 13 0 3 >>> 15 0 2 >>> , >>> censored:control censored:miR137 >>> event_at >>> 6 0 0 >>> 7 0 0 >>> 9 0 0 >>> , >>> ]
See also

lifelines.utils.
datetimes_to_durations
(start_times, end_times, fill_date=datetime.datetime(2019, 2, 14, 13, 55, 42, 96344), freq='D', dayfirst=False, na_values=None)Â¶ This is a very flexible function for transforming arrays of start_times and end_times to the proper format for lifelines: duration and event observation arrays.
Parameters:  start_times (an array, Series or DataFrame) â€“ iterable representing start times. These can be strings, or datetime objects.
 end_times (an array, Series or Dataframe) â€“ iterable representing end times. These can be strings, or datetimes. These values can be None, or an empty string, which corresponds to censorship.
 fill_date (datetime, optional (default=datetime.Today())) â€“ the date to use if end_times is a None or empty string. This corresponds to last date of observation. Anything after this date is also censored.
 freq (string, optional (default=â€™Dâ€™)) â€“ the units of time to use. See Pandas â€freqâ€™. Default â€Dâ€™ for days.
 dayfirst (boolean, optional (default=False)) â€“ convert assuming Europeanstyle dates, i.e. day/month/year.
 na_values (list, optional) â€“ list of values to recognize as NA/NaN. Ex: [â€â€™, â€NaTâ€™]
Returns:  T (numpy array) â€“ array of floats representing the durations with time units given by freq.
 C (numpy array) â€“ boolean array of event observations: 1 if death observed, 0 else.
Examples
>>> from lifelines.utils import datetimes_to_durations >>> >>> start_dates = ['20150101', '20150401', '20140405'] >>> end_dates = ['20160202', None, '20140506'] >>> >>> T, E = datetimes_to_durations(start_dates, end_dates, freq="D") >>> T # array([ 397., 1414., 31.]) >>> E # array([ True, False, True])

lifelines.utils.
concordance_index
(event_times, predicted_scores, event_observed=None)Â¶ Calculates the concordance index (Cindex) between two series of event times. The first is the real survival times from the experimental data, and the other is the predicted survival times from a model of some kind.
The concordance index is a value between 0 and 1 where, 0.5 is the expected result from random predictions, 1.0 is perfect concordance and, 0.0 is perfect anticoncordance (multiply predictions with 1 to get 1.0)
Parameters:  event_times (iterable) â€“ a lengthn iterable of observed survival times.
 predicted_scores (iterable) â€“ a lengthn iterable of predicted scores  these could be survival times, or hazards, etc. See https://stats.stackexchange.com/questions/352183/usemediansurvivaltimetocalculatecphcstatistic/352435#352435
 event_observed (iterable, optional) â€“ a lengthn iterable censorship flags, 1 if observed, 0 if not. Default None assumes all observed.
Returns: cindex â€“ a value between 0 and 1.
Return type: float
References
Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine 1996;15(4):36187.
Examples
>>> from lifelines.utils import concordance_index >>> cph = CoxPHFitter().fit(df, 'T', 'E') >>> concordance_index(df['T'], cph.predict_partial_hazard(df), df['E'])

lifelines.utils.
k_fold_cross_validation
(fitters, df, duration_col, event_col=None, k=5, evaluation_measure=<function concordance_index>, predictor='predict_expectation', predictor_kwargs={}, fitter_kwargs={})Â¶ Perform cross validation on a dataset. If multiple models are provided, all models will train on each of the k subsets.
Parameters: fitters (model) â€“ one or several objects which possess a method: fit(self, data, duration_col, event_col) Note that the last two arguments will be given as keyword arguments, and that event_col is optional. The objects must also have the â€śpredictorâ€ť method defined below.
df (DataFrame) â€“ a Pandas dataframe with necessary columns duration_col and event_col, plus other covariates. duration_col refers to the lifetimes of the subjects. event_col refers to whether the â€deathâ€™ events was observed: 1 if observed, 0 else (censored).
duration_col ((n,) array) â€“ the column in dataframe that contains the subjects lifetimes.
event_col ((n,) array) â€“ the column in dataframe that contains the subjectâ€™s death observation. If left as None, assumes all individuals are noncensored.
k (int) â€“ the number of folds to perform. n/k data will be withheld for testing on.
evaluation_measure (function) â€“ a function that accepts either (event_times, predicted_event_times), or (event_times, predicted_event_times, event_observed) and returns something (could be anything). Default: statistics.concordance_index: (Cindex) between two series of event times
predictor (string) â€“ a string that matches a prediction method on the fitter instances. For example, â€śpredict_expectationâ€ť or â€śpredict_percentileâ€ť. Default is â€śpredict_expectationâ€ť The interface for the method is:
predict(self, data, **optional_kwargs)
fitter_kwargs â€“ keyword args to pass into fitter.fit method
predictor_kwargs â€“ keyword args to pass into predictormethod.
Returns: results â€“ (k,1) list of scores for each fold. The scores can be anything.
Return type: list

lifelines.utils.
to_long_format
(df, duration_col)Â¶ This function converts a survival analysis dataframe to a lifelines â€ślongâ€ť format. The lifelines â€ślongâ€ť format is used in a common next function,
add_covariate_to_timeline
.Parameters:  df (DataFrame) â€“ a Dataframe in the standard survival analysis form (one for per observation, with covariates, duration and event flag)
 duration_col (string) â€“ string representing the column in df that represents the durations of each subject.
Returns: long_form_df â€“ A DataFrame with new columns. This can be fed into add_covariate_to_timeline
Return type: DataFrame

lifelines.utils.
to_episodic_format
(df, duration_col, event_col, id_col=None, time_gaps=1)Â¶ This function takes a â€śflatâ€ť dataset (that is, nontimevarying), and converts it into a timevarying dataset with static variables.
Useful if your dataset has variables that do not satisfy the proportional hazard assumption, and you need to create a timevarying dataset to include interaction terms with time.
Parameters:  df (DataFrame) â€“ a DataFrame of the static dataset.
 duration_col (string) â€“ string representing the column in df that represents the durations of each subject.
 event_col (string) â€“ string representing the column in df that represents whether the subject experienced the event or not.
 id_col (string, optional) â€“ Specify the column that represents an id, else lifelines creates an autoincrementing one.
 time_gaps (float or int) â€“ Specify a desired time_gap. For example, if time_gap is 2 and a subject lives for 10.5 units of time, then the final long form will have 5 + 1 rows for that subject: (0, 2], (2, 4], (4, 6], (6, 8], (8, 10], (10, 10.5] Smaller time_gaps will produce larger dataframes, and larger time_gaps will produce smaller dataframes. In the limit, the long dataframe will be identical to the original dataframe.
Returns: Return type: DataFrame
Example
>>> from lifelines.datasets import load_rossi >>> from lifelines.utils import to_episodic_format >>> rossi = load_rossi() >>> long_rossi = to_episodic_format(rossi, 'week', 'arrest', time_gaps=2.) >>> >>> from lifelines import CoxTimeVaryingFitter >>> ctv = CoxTimeVaryingFitter() >>> # age variable violates proprotional hazard >>> long_rossi['time * age'] = long_rossi['stop'] * long_rossi['age'] >>> ctv.fit(long_rossi, id_col='id', event_col='arrest', show_progress=True) >>> ctv.print_summary()
See also

lifelines.utils.
add_covariate_to_timeline
(long_form_df, cv, id_col, duration_col, event_col, add_enum=False, overwrite=True, cumulative_sum=False, cumulative_sum_prefix='cumsum_', delay=0)Â¶ This is a util function to help create a long form table tracking subjectsâ€™ covariate changes over time. It is meant to be used iteratively as one adds more and more covariates to track over time. Before using this function, it is recommended to view the documentation at https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html#datasetcreationfortimevaryingregression.
Parameters:  long_form_df (DataFrame) â€“ a DataFrame that has the intial or intermediate â€ślongâ€ť form of timevarying observations. Must contain columns id_col, â€startâ€™, â€stopâ€™, and event_col. See function to_long_format to transform data into long form.
 cv (DataFrame) â€“ a DataFrame that contains (possibly more than) one covariate to track over time. Must contain columns id_col and duration_col. duration_col represents time since the start of the subjectâ€™s life.
 id_col (string) â€“ the column in long_form_df and cv representing a unique identifier for subjects.
 duration_col (string) â€“ the column in cv that represents the timesincebirth the observation occured at.
 event_col (string) â€“ the column in df that represents if the eventofinterest occured
 add_enum (boolean, optional) â€“ a Boolean flag to denote whether to add a column enumerating rows per subject. Useful to specify a specific observation, ex: df[df[â€enumâ€™] == 1] will grab the first observations per subject.
 overwrite (boolean, optional) â€“ if True, covariate values in long_form_df will be overwritten by covariate values in cv if the column exists in both cv and long_form_df and the timestamps are identical. If False, the default behaviour will be to sum the values together.
 cumulative_sum (boolean, optional) â€“ sum over time the new covariates. Makes sense if the covariates are new additions, and not state changes (ex: administering more drugs vs taking a temperature.)
 cumulative_sum_prefix (string, optional) â€“ a prefix to add to calculated cumulative sum columns
 delay (int, optional) â€“ add a delay to covariates (useful for checking for reverse causality in analysis)
Returns: long_form_df â€“ A DataFrame with updated rows to reflect the novel times slices (if any) being added from cv, and novel (or updated) columns of new covariates from cv
Return type: DataFrame

lifelines.utils.
covariates_from_event_matrix
(df, id_col)Â¶ This is a helper function to handle binary event datastreams in a specific format and convert it to a format that add_covariate_to_timeline will accept. For example, suppose you have a dataset that looks like:
id promotion movement raise 0 1 1.0 NaN 2.0 1 2 NaN 5.0 NaN 2 3 3.0 5.0 7.0
where the values (aside from the id column) represent when an event occured for a specific user, relative to the subjectâ€™s birth/entry. This is a common way format to pull data from a SQL table. We call this a duration matrix, and we want to convert this dataframe to a format that can be included in a long form dataframe (see add_covariate_to_timeline for more details on this).
The duration matrix should have 1 row per subject (but not necessarily all subjects).
Parameters:  df (DataFrame) â€“ the DataFrame we want to transform
 id_col (string) â€“ the column in long_form_df and cv representing a unique identifier for subjects.
Example
>>> cv = covariates_from_event_matrix(duration_df, 'id') >>> long_form_df = add_covariate_to_timeline(long_form_df, cv, 'id', 'duration', 'e', cumulative_sum=True)
lifelines.statisticsÂ¶

class
lifelines.statistics.
StatisticalResult
(p_value, test_statistic, name=None, **kwargs)Â¶ Bases:
object
This class holds the result of statistical tests with a nice printer wrapper to display the results.
Note
This classâ€™ API changed in version 0.16.0.
Parameters:  p_value (iterable or float) â€“ the pvalues of a statistical test(s)
 test_statistic (iterable or float) â€“ the test statistics of a statistical test(s). Must be the same size as pvalues if iterable.
 name (iterable or string) â€“ if this class holds multiple results (ex: from a pairwise comparison), this can hold the names. Must be the same size as pvalues if iterable.
 kwargs â€“ additional information to display in
print_summary()
.

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

summary
Â¶ returns: a DataFrame containing the test statistics and the pvalue :rtype: DataFrame

lifelines.statistics.
logrank_test
(durations_A, durations_B, event_observed_A=None, event_observed_B=None, t_0=1, **kwargs)Â¶ Measures and reports on whether two intensity processes are different. That is, given two event series, determines whether the data generating processes are statistically different. The teststatistic is chisquared under the null hypothesis. Let \(h_i(t)\) be the hazard ratio of group \(i\) at time \(t\), then:
\[\begin{split}\begin{align} & H_0: h_1(t) = h_2(t) \\ & H_A: h_1(t) = c h_2(t), \;\; c \ne 1 \end{align}\end{split}\]This implicitly uses the logrank weights.
Note
The logrank test has maximum power when the assumption of proportional hazards is true. As a consquence, if the survival curves cross, the logrank test will give an inaccurate assessment of differences.
Parameters:  durations_A (iterable) â€“ a (n,) listlike of event durations (birth to death,â€¦) for the first population.
 durations_B (iterable) â€“ a (n,) listlike of event durations (birth to death,â€¦) for the second population.
 event_observed_A (iterable, optional) â€“ a (n,) listlike of censorship flags, (1 if observed, 0 if not), for the first population. Default assumes all observed.
 event_observed_B (iterable, optional) â€“ a (n,) listlike of censorship flags, (1 if observed, 0 if not), for the second population. Default assumes all observed.
 t_0 (float, optional (default=1)) â€“ the final time period under observation, 1 for all time.
 kwargs â€“ add keywords and metadata to the experiment summary
Returns: a StatisticalResult object with properties
p_value
,summary
,test_statistic
,print_summary
Return type: Examples
>>> T1 = [1, 4, 10, 12, 12, 3, 5.4] >>> E1 = [1, 0, 1, 0, 1, 1, 1] >>> >>> T2 = [4, 5, 7, 11, 14, 20, 8, 8] >>> E2 = [1, 1, 1, 1, 1, 1, 1, 1] >>> >>> from lifelines.statistics import logrank_test >>> results = logrank_test(T1, T2, event_observed_A=E1, event_observed_B=E2) >>> >>> results.print_summary() >>> print(results.p_value) # 0.7676 >>> print(results.test_statistic) # 0.0872
Notes
This is a special case of the function
multivariate_logrank_test
, which is used internally. See Survival and Event Analysis, page 108.

lifelines.statistics.
multivariate_logrank_test
(event_durations, groups, event_observed=None, t_0=1, **kwargs)Â¶ This test is a generalization of the logrank_test: it can deal with n>2 populations (and should be equal when n=2):
\[\begin{split}\begin{align} & H_0: h_1(t) = h_2(t) = h_3(t) = ... = h_n(t) \\ & H_A: \text{there exist atleast one group that differs from the other.} \end{align}\end{split}\]Parameters:  event_durations (iterable) â€“ a (n,) listlike representing the (possibly partial) durations of all individuals
 groups (iterable) â€“ a (n,) listlike of unique group labels for each individual.
 event_observed (iterable, optional) â€“ a (n,) listlike of event_observed events: 1 if observed death, 0 if censored. Defaults to all observed.
 t_0 (float, optional (default=1)) â€“ the period under observation, 1 for all time.
 kwargs â€“ add keywords and metadata to the experiment summary.
Returns: a StatisticalResult object with properties
p_value
,summary
,test_statistic
,print_summary
Return type: Examples
>>> df = pd.DataFrame({ >>> 'durations': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> 'events': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0], >>> 'groups': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2] >>> }) >>> result = multivariate_logrank_test(df['durations'], df['groups'], df['events']) >>> result.test_statistic >>> result.p_value >>> result.print_summary()
>>> # numpy example >>> G = [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2] >>> T = [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7] >>> E = [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0] >>> result = multivariate_logrank_test(T, G, E) >>> result.test_statistic
See also

lifelines.statistics.
pairwise_logrank_test
(event_durations, groups, event_observed=None, t_0=1, **kwargs)Â¶ Perform the logrank test pairwise for all \(n \ge 2\) unique groups.
Parameters:  event_durations (iterable) â€“ a (n,) listlike representing the (possibly partial) durations of all individuals
 groups (iterable) â€“ a (n,) listlike of unique group labels for each individual.
 event_observed (iterable, optional) â€“ a (n,) listlike of event_observed events: 1 if observed death, 0 if censored. Defaults to all observed.
 t_0 (float, optional (default=1)) â€“ the period under observation, 1 for all time.
 kwargs â€“ add keywords and metadata to the experiment summary.
Returns: a StatisticalResult object that contains all the pairwise comparisons (try
StatisticalResult.summary
orStatisticalResult.print_summarty
)Return type: See also

lifelines.statistics.
survival_difference_at_fixed_point_in_time_test
(point_in_time, durations_A, durations_B, event_observed_A=None, event_observed_B=None, **kwargs)Â¶ Often analysts want to compare the survivalness of groups at specific times, rather than comparing the entire survival curves against each other. For example, analysts may be interested in 5year survival. Statistically comparing the naive KaplanMeier points at a specific time actually has reduced power (see [1]). By transforming the KaplanMeier curve, we can recover more power. This function uses the log(log) transformation.
Parameters:  point_in_time (float,) â€“ the point in time to analyze the survival curves at.
 durations_A (iterable) â€“ a (n,) listlike of event durations (birth to death,â€¦) for the first population.
 durations_B (iterable) â€“ a (n,) listlike of event durations (birth to death,â€¦) for the second population.
 event_observed_A (iterable, optional) â€“ a (n,) listlike of censorship flags, (1 if observed, 0 if not), for the first population. Default assumes all observed.
 event_observed_B (iterable, optional) â€“ a (n,) listlike of censorship flags, (1 if observed, 0 if not), for the second population. Default assumes all observed.
 kwargs â€“ add keywords and metadata to the experiment summary
Returns: a StatisticalResult object with properties
p_value
,summary
,test_statistic
,print_summary
Return type: Examples
>>> T1 = [1, 4, 10, 12, 12, 3, 5.4] >>> E1 = [1, 0, 1, 0, 1, 1, 1] >>> >>> T2 = [4, 5, 7, 11, 14, 20, 8, 8] >>> E2 = [1, 1, 1, 1, 1, 1, 1, 1] >>> >>> from lifelines.statistics import survival_difference_at_fixed_point_in_time_test >>> results = survival_difference_at_fixed_point_in_time_test(12, T1, T2, event_observed_A=E1, event_observed_B=E2) >>> >>> results.print_summary() >>> print(results.p_value) # 0.893 >>> print(results.test_statistic) # 0.017
Notes
Other transformations are possible, but Klein et al. [1] showed that the log(log(c)) transform has the most desirable statistical properties.
References
[1] Klein, J. P., Logan, B. , Harhoff, M. and Andersen, P. K. (2007), Analyzing survival curves at a fixed point in time. Statist. Med., 26: 45054519. doi:10.1002/sim.2864

lifelines.statistics.
proportional_hazard_test
(fitted_cox_model, training_df, time_transform='rank', precomputed_residuals=None, **kwargs)Â¶ Test whether any variable in a Cox model breaks the proportional hazard assumption.
Parameters:  fitted_cox_model (CoxPHFitter) â€“ the fitted Cox model, fitted with training_df, you wish to test. Currently only the CoxPHFitter is supported, but later CoxTimeVaryingFitter, too.
 training_df (DataFrame) â€“ the DataFrame used in the call to the Cox modelâ€™s
fit
.  time_transform (vectorized function, list, or string, optional (default=â€™rankâ€™)) â€“ {â€allâ€™, â€kmâ€™, â€rankâ€™, â€identityâ€™, â€logâ€™} One of the strings above, a list of strings, or a function to transform the time (must accept (time, durations, weights) however). â€allâ€™ will present all the transforms.
 precomputed_residuals (DataFrame, optional) â€“ specify the residuals, if already computed.
 kwargs â€“ additional parameters to add to the StatisticalResult
Returns: Return type: Notes
R uses the default km, we use rank, as this performs well versus other transforms. See http://eprints.lse.ac.uk/84988/1/06_ParkHendry2015ReassessingSchoenfeldTests_Final.pdf

lifelines.statistics.
power_under_cph
(n_exp, n_con, p_exp, p_con, postulated_hazard_ratio, alpha=0.05)Â¶ This computes the power of the hypothesis test that the two groups, experiment and control, have different hazards (that is, the relative hazard ratio is different from 1.)
Parameters:  n_exp (integer) â€“ size of the experiment group.
 n_con (integer) â€“ size of the control group.
 p_exp (float) â€“ probability of failure in experimental group over period of study.
 p_con (float) â€“ probability of failure in control group over period of study
 postulated_hazard_ratio (float)
 the postulated hazard ratio
 alpha (float, optional (default=0.05)) â€“ type I error rate
Returns: power to detect the magnitude of the hazard ratio as small as that specified by postulated_hazard_ratio.
Return type: float
Notes
See also

lifelines.statistics.
sample_size_necessary_under_cph
(power, ratio_of_participants, p_exp, p_con, postulated_hazard_ratio, alpha=0.05)Â¶ This computes the sample size for needed power to compare two groups under a Cox Proportional Hazard model.
Parameters:  power (float) â€“ power to detect the magnitude of the hazard ratio as small as that specified by postulated_hazard_ratio.
 ratio_of_participants (ratio of participants in experimental group over control group.)
 p_exp (float) â€“ probability of failure in experimental group over period of study.
 p_con (float) â€“ probability of failure in control group over period of study
 postulated_hazard_ratio (float) â€“ the postulated hazard ratio
 alpha (float, optional (default=0.05)) â€“ type I error rate
Returns:  n_exp (integer) â€“ the samples sizes need for the experiment to achieve desired power
 n_con (integer) â€“ the samples sizes need for the control group to achieve desired power
Examples
>>> from lifelines.statistics import sample_size_necessary_under_cph >>> >>> desired_power = 0.8 >>> ratio_of_participants = 1. >>> p_exp = 0.25 >>> p_con = 0.35 >>> postulated_hazard_ratio = 0.7 >>> n_exp, n_con = sample_size_necessary_under_cph(desired_power, ratio_of_participants, p_exp, p_con, postulated_hazard_ratio) >>> # (421, 421)
https://cran.rproject.org/web/packages/powerSurvEpi/powerSurvEpi.pdf
See also
lifelines.plottingÂ¶

lifelines.plotting.
add_at_risk_counts
(*fitters, **kwargs)Â¶ Add counts showing how many individuals were at risk at each time point in survival/hazard plots.
Parameters: fitters â€“ One or several fitters, for example KaplanMeierFitter, NelsonAalenFitter, etcâ€¦ Returns: ax Return type: The axes which was used. Examples
>>> # First train some fitters and plot them >>> fig = plt.figure() >>> ax = plt.subplot(111) >>> >>> f1 = KaplanMeierFitter() >>> f1.fit(data) >>> f1.plot(ax=ax) >>> >>> f2 = KaplanMeierFitter() >>> f2.fit(data) >>> f2.plot(ax=ax) >>> >>> # There are equivalent >>> add_at_risk_counts(f1, f2) >>> add_at_risk_counts(f1, f2, ax=ax, fig=fig) >>> >>> # This overrides the labels >>> add_at_risk_counts(f1, f2, labels=['fitter one', 'fitter two']) >>> >>> # This hides the labels >>> add_at_risk_counts(f1, f2, labels=None)

lifelines.plotting.
plot_lifetimes
(durations, event_observed=None, entry=None, left_truncated=False, sort_by_duration=True, event_observed_color='#A60628', event_censored_color='#348ABD', **kwargs)Â¶ Retuns a lifetime plot, see examples: https://lifelines.readthedocs.io/en/latest/Survival%20Analysis%20intro.html#Censoring
Parameters:  durations ((n,) numpy array or pd.Series) â€“ duration subject was observed for.
 event_observed ((n,) numpy array or pd.Series) â€“ array of booleans: True if event observed, else False.
 entry ((n,) numpy array or pd.Series) â€“ offsetting the births away from t=0. This could be from lefttruncation, or delayed entry into study.
 left_truncated (boolean) â€“ if entry is provided, and the data is lefttruncated, this will display additional information in the plot to reflect this.
 sort_by_duration (boolean) â€“ sort by the duration vector
 event_observed_color (str) â€“ default: â€ś#A60628â€ť
 event_censored_color (str) â€“ default: â€ś#348ABDâ€ť
Returns: Return type: ax
lifelines.datasetsÂ¶

lifelines.datasets.
load_canadian_senators
(**kwargs)Â¶ A history of Canadian senators in office.:
Size: (933,10) Example: Name Abbott, John Joseph Caldwell Political Affiliation at Appointment LiberalConservative Province / Territory Quebec Appointed on the advice of Macdonald, John Alexander Term (yyyy.mm.dd) 1887.05.12  1893.10.30 (Death) start_date 18870512 00:00:00 end_date 18931030 00:00:00 reason Death diff_days 2363 observed True

lifelines.datasets.
load_dd
(**kwargs)Â¶ Classification of political regimes as democracy and dictatorship. Classification of democracies as parliamentary, semipresidential (mixed) and presidential. Classification of dictatorships as military, civilian and royal. Coverage: 202 countries, from 1946 or year of independence to 2008.:
Size: (1808, 12) Example: ctryname Afghanistan cowcode2 700 politycode 700 un_region_name Southern Asia un_continent_name Asia ehead Mohammad Zahir Shah leaderspellreg Mohammad Zahir Shah.Afghanistan.1946.1952.Mona... democracy Nondemocracy regime Monarchy start_year 1946 duration 7 observed 1
References
Cheibub, JosĂ© Antonio, Jennifer Gandhi, and James Raymond Vreeland. 2010. â€śDemocracy and Dictatorship Revisited.â€ť Public Choice, vol. 143, no. 21, pp. 67101.

lifelines.datasets.
load_dfcv
()Â¶ A toy example of a time dependent dataset.
Size: (14, 6) Example: start group z stop id event 0 0 1.0 0 3.0 1 True 1 0 1.0 0 5.0 2 False 2 0 1.0 1 5.0 3 True 3 0 1.0 0 6.0 4 True
References

lifelines.datasets.
load_g3
(**kwargs)Â¶ Size: (17,7) Example: no. 1 age 41 sex Female histology Grade3 group RIT event True time 53

lifelines.datasets.
load_gbsg2
(**kwargs)Â¶ A data frame containing the observations from the GBSG2 study of 686 women.:
Size: (686,10) Example: horTh yes age 56 menostat Post tsize 12 tgrade II pnodes 7 progrec 61 estrec 77 time 2018 cens 1
References
 Sauerbrei and P. Royston (1999). Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. Journal of the Royal Statistics Society Series A, Volume 162(1), 71â€“94
 Schumacher, G. Basert, H. Bojar, K. Huebner, M. Olschewski, W. Sauerbrei, C. Schmoor, C. Beyerle, R.L.A. Neumann and H.F. Rauschecker for the German Breast Cancer Study Group (1994), Randomized 2 Ă— 2 trial evaluating hormonal treatment and the duration of chemotherapy in node positive breast cancer patients. Journal of Clinical Oncology, 12, 2086â€“2093

lifelines.datasets.
load_holly_molly_polly
(**kwargs)Â¶ From https://stat.ethz.ch/education/semesters/ss2011/seminar/contents/presentation_10.pdf Used as a toy example for CoxPH in recurrent SA.:
ID Status Stratum Start(days) Stop(days) tx T 0 M 1 1 0 100 1 100 1 M 1 2 100 105 1 5 2 H 1 1 0 30 0 30 3 H 1 2 30 50 0 20 4 P 1 1 0 20 0 20

lifelines.datasets.
load_kidney_transplant
(**kwargs)Â¶ Size: (863,6) Example: time 5 death 0 age 51 black_male 0 white_male 1 black_female 0

lifelines.datasets.
load_larynx
(**kwargs)Â¶ Size: (89,6) Example: time age death Stage II Stage III Stage IV 0 0.6 77 1 0 0 0 1 1.3 53 1 0 0 0 2 2.4 45 1 0 0 0 3 2.5 57 0 0 0 0 4 3.2 58 1 0 0 0

lifelines.datasets.
load_lcd
(**kwargs)Â¶ Size: (104,3) Example: C T group 0 0 1 alluvial_fan 1 0 1 alluvial_fan 2 0 1 alluvial_fan 3 0 1 alluvial_fan 4 1 1 alluvial_fan

lifelines.datasets.
load_leukemia
(**kwargs)Â¶ Leukemia dataset.:
Size: (42,5) Example: t status sex logWBC Rx 0 35 0 1 1.45 0 1 34 0 1 1.47 0 2 32 0 1 2.20 0 3 32 0 1 2.53 0 4 25 0 1 1.78 0
References
From http://web1.sph.emory.edu/dkleinb/allDatasets/surv2datasets/anderson.dat

lifelines.datasets.
load_lung
(**kwargs)Â¶ Survival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.:
Size: (288,10) Example: inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss 0 3.0 306 2 74 1 1.0 90.0 100.0 1175.0 NaN 1 3.0 455 2 68 1 0.0 90.0 90.0 1225.0 15.0 2 3.0 1010 1 56 1 0.0 90.0 90.0 NaN 15.0 3 5.0 210 2 57 1 1.0 90.0 60.0 1150.0 11.0 4 1.0 883 2 60 1 0.0 100.0 90.0 NaN 0.0
References
Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patientcompleted questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):6017, 1994.

lifelines.datasets.
load_lymphoma
(**kwargs)Â¶ Size: (80, 3) Example: Stage_group Time Censor 0 1 6 1 1 1 19 1 2 1 32 1 3 1 42 1 4 1 42 1
References
From https://www.statsdirect.com/help/content/survival_analysis/logrank.htm

lifelines.datasets.
load_multicenter_aids_cohort_study
(**kwargs)Â¶ Originally in [1]:
Siz: (78, 4) AIDSY: date of AIDS diagnosis W: years from AIDS diagnosis to study entry T: years from AIDS diagnosis to minimum of death or censoring D: indicator of death during follow up i AIDSY W T D 1 1990.425 4.575 7.575 0 2 1991.250 3.750 6.750 0 3 1992.014 2.986 5.986 0 4 1992.030 2.970 5.970 0 5 1992.072 2.928 5.928 0 6 1992.220 2.780 4.688 1
References
[1] Cole SR, Hudgens MG. Survival analysis in infectious disease research: describing events in time. AIDS. 2010;24(16):242331.

lifelines.datasets.
load_panel_test
(**kwargs)Â¶ Size: (28,5) Example: id t E var1 var2 0 1 1 0 0.0 1 1 1 2 0 0.0 1 2 1 3 0 4.0 3 3 1 4 1 8.0 4 4 2 1 0 1.2 1

lifelines.datasets.
load_psychiatric_patients
(**kwargs)Â¶ Size: (26,4) Example: Age T C sex 0 51 1 1 2 1 58 1 1 2 2 55 2 1 2 3 28 22 1 2 4 21 30 0 1

lifelines.datasets.
load_recur
(**kwargs)Â¶ From ftp://ftp.wiley.com/public/sci_tech_med/survival/, first published in â€śApplied Survival Analysis: Regression Modeling of Time to Event Data, Second Editionâ€ť:
ID Subject Identification 1  400 AGE Age years TREAT Treatment Assignment 0 = New 1 = Old TIME0 Day of Previous Episode Days TIME1 Day of New Episode Days or censoring CENSOR Indicator for Soreness 1 = Episode Occurred Episode or Censoring at TIME1 0 = Censored EVENT Soreness Episode Number 0 to at most 4 Size: (1296, 7) Example: ID,AGE,TREAT,TIME0,TIME1,CENSOR,EVENT 1,43,0,9,56,1,3 1,43,0,56,88,1,4 1,43,0,0,6,1,1 1,43,0,6,9,1,2

lifelines.datasets.
load_regression_dataset
(**kwargs)Â¶ Artificial regression dataset. Useful since there are no ties in this dataset. Slightly edit in v0.15.0 to achieve this, however.:
Size: (200,5) Example: var1 var2 var3 T E 0 0.595170 1.143472 1.571079 14.785479 1 1 0.209325 0.184677 0.356980 7.336734 1 2 0.693919 0.071893 0.557960 5.271527 1 3 0.443804 1.364646 0.374221 11.684168 1 4 1.613324 0.125566 1.921325 7.637764 1

lifelines.datasets.
load_rossi
(**kwargs)Â¶ This data set is originally from Rossi et al. (1980), and is used as an example in Allison (1995). The data pertain to 432 convicts who were released from Maryland state prisons in the 1970s and who were followed up for one year after release. Half the released convicts were assigned at random to an experimental treatment in which they were given financial aid; half did not receive aid.:
Size: (432,9) Example: week 20 arrest 1 fin 0 age 27 race 1 wexp 0 mar 0 paro 1 prio 3
References
Rossi, P.H., R.A. Berk, and K.J. Lenihan (1980). Money, Work, and Crime: Some Experimental Results. New York: Academic Press. John Fox, Marilia Sa Carvalho (2012). The RcmdrPlugin.survival Package: Extending the R Commander Interface to Survival Analysis. Journal of Statistical Software, 49(7), 132.

lifelines.datasets.
load_stanford_heart_transplants
(**kwargs)Â¶ This is a classic dataset for survival regression with time varying covariates. The original dataset is from [1], and this dataset is from Râ€™s survival library.:
Size: (172, 8) Example: start stop event age year surgery transplant id 0 0.0 50.0 1 17.155373 0.123203 0 0 1 1 0.0 6.0 1 3.835729 0.254620 0 0 2 2 0.0 1.0 0 6.297057 0.265572 0 0 3 3 1.0 16.0 1 6.297057 0.265572 0 1 3 4 0.0 36.0 0 7.737166 0.490075 0 0 4
References
 [1] J Crowley and M Hu. Covariance analysis of heart transplant survival data. J American
 Statistical Assoc, 72:27â€“36, 1977.

lifelines.datasets.
load_static_test
(**kwargs)Â¶ Size: (7,5) Example: id t E var1 var2 0 1 4 1 1 1 1 2 3 1 2 2 2 3 3 0 3 3 3 4 4 1 4 4 4 5 2 1 5 5 5 6 0 1 6 6 6 7 2 1 7 7

lifelines.datasets.
load_waltons
(**kwargs)Â¶ Genotypes and number of days survived in Drosophila. Since we work with flies, we donâ€™t need to worry about leftcensoring. We know the birth date of all flies. We do have issues with accidentally killing some or if some escape. These would be rightcensored as we do not actually observe their death due to â€śnaturalâ€ť causes.:
Size: (163,3) Example: T E group 0 6 1 miR137 1 13 1 miR137 2 13 1 miR137 3 13 1 miR137 4 19 1 miR137

class
lifelines.
KaplanMeierFitter
(alpha=0.95)Â¶ Bases:
lifelines.fitters.UnivariateFitter
Class for fitting the KaplanMeier estimate for the survival function.
Parameters: alpha (float, option (default=0.95)) â€“ The alpha value associated with the confidence intervals. Examples
>>> from lifelines import KaplanMeierFitter >>> from lifelines.datasets import load_waltons >>> waltons = load_waltons() >>> kmf = KaplanMeierFitter() >>> kmf.fit(waltons['T'], waltons['E']) >>> kmf.plot()

conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

cumulative_hazard_at_times
(times, label=None)Â¶

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed=None, timeline=None, entry=None, label='KM_estimate', alpha=None, left_censorship=False, ci_labels=None, weights=None)Â¶ Parameters:  durations (an array, or pd.Series, of length n â€“ duration subject was observed for)
 timeline (return the best estimate at the values in timelines (postively increasing))
 event_observed (an array, or pd.Series, of length n â€“ True if the the death was observed, False if the event) â€“ was lost (rightcensored). Defaults all True if event_observed==None
 entry (an array, or pd.Series, of length n â€“ relative time when a subject entered the study. This is) â€“ useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were â€śbornâ€ť.
 label (a string to name the column of the estimate.)
 alpha (the alpha value in the confidence intervals. Overrides the initializing) â€“ alpha for this call to fit only.
 left_censorship (True if durations and event_observed refer to left censorship events. Default False)
 ci_labels (add custom column names to the generated confidence intervals) â€“ as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 weights (n array, or pd.Series, of length n, if providing a weighted dataset. For example, instead) â€“ of providing every subject as a single element of durations and event_observed, one could weigh subject differently.
Returns: self â€“ self with new properties like
survival_function_
.Return type:

hazard_at_times
(times, label=None)Â¶

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_loglogs
(*args, **kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

survival_function_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted survival value at specific times
Parameters: times (iterable or float) Returns: Return type: pd.Series


class
lifelines.
NelsonAalenFitter
(alpha=0.95, nelson_aalen_smoothing=True)Â¶ Bases:
lifelines.fitters.UnivariateFitter
Class for fitting the NelsonAalen estimate for the cumulative hazard.
NelsonAalenFitter(alpha=0.95, nelson_aalen_smoothing=True)
Parameters:  alpha (float, optional) â€“ The alpha value associated with the confidence intervals.
 nelson_aalen_smoothing (bool, optional) â€“ If the event times are naturally discrete (like discrete years, minutes, etc.) then it is advisable to turn this parameter to False. See [1], pg.84.
Notes
[1] Aalen, O., Borgan, O., Gjessing, H., 2008. Survival and Event History Analysis

conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

cumulative_hazard_at_times
(times, label=None)Â¶

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed=None, timeline=None, entry=None, label='NA_estimate', alpha=None, ci_labels=None, weights=None)Â¶ Parameters:  durations (an array, or pd.Series, of length n) â€“ duration subject was observed for
 timeline (iterable) â€“ return the best estimate at the values in timelines (postively increasing)
 event_observed (an array, or pd.Series, of length n) â€“ True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 entry (an array, or pd.Series, of length n) â€“ relative time when a subject entered the study. This is useful for lefttruncated observations, i.e the birth event was not observed. If None, defaults to all 0 (all birth events observed.)
 label (string) â€“ a string to name the column of the estimate.
 alpha (float) â€“ the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (iterable) â€“ add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 weights (n array, or pd.Series, of length n) â€“ if providing a weighted dataset. For example, instead of providing every subject as a single element of durations and event_observed, one could weigh subject differently.
Returns: Return type: self, with new properties like
cumulative_hazard_
.

hazard_at_times
(times, label=None)Â¶

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

smoothed_hazard_
(bandwidth)Â¶ Parameters: bandwidth (float) â€“ the bandwith used in the Epanechnikov kernel. Returns: a DataFrame of the smoothed hazard Return type: DataFrame

smoothed_hazard_confidence_intervals_
(bandwidth, hazard_=None)Â¶ Parameters:  bandwidth (float) â€“ the bandwith to use in the Epanechnikov kernel. > 0
 hazard_ (numpy array) â€“ a computed (n,) numpy array of estimated hazard rates. If none, uses
smoothed_hazard_

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

survival_function_at_times
(times, label=None)Â¶

class
lifelines.
AalenAdditiveFitter
(fit_intercept=True, alpha=0.95, coef_penalizer=0.0, smoothing_penalizer=0.0)Â¶ Bases:
lifelines.fitters.BaseFitter
This class fits the regression model:
\[h(tx) = b_0(t) + b_1(t) x_1 + ... + b_N(t) x_N\]that is, the hazard rate is a linear function of the covariates with timevarying coefficients. This implementation assumes nontimevarying covariates, see
TODO: name
Note
This class was rewritten in lifelines 0.17.0 to focus solely on static datasets. There is no guarantee of backwards compatibility.
Parameters:  fit_intercept (bool, optional (default: True)) â€“ If False, do not attach an intercept (column of ones) to the covariate matrix. The intercept, \(b_0(t)\) acts as a baseline hazard.
 alpha (float) â€“ the level in the confidence intervals.
 coef_penalizer (float, optional (default: 0)) â€“ Attach a L2 penalizer to the size of the coeffcients during regression. This improves stability of the estimates and controls for high correlation between covariates. For example, this shrinks the absolute value of \(c_{i,t}\).
 smoothing_penalizer (float, optional (default: 0)) â€“ Attach a L2 penalizer to difference between adjacent (over time) coefficents. For example, this shrinks the absolute value of \(c_{i,t}  c_{i,t+1}\).

fit
(df, duration_col, event_col=None, weights_col=None, show_progress=False)Â¶ Parameters: Fit the Aalen Additive model to a dataset.
Parameters:  df (DataFrame) â€“ a Pandas dataframe with necessary columns duration_col and event_col (see below), covariates columns, and special columns (weights). duration_col refers to the lifetimes of the subjects. event_col refers to whether the â€deathâ€™ events was observed: 1 if observed, 0 else (censored).
 duration_col (string) â€“ the name of the column in dataframe that contains the subjectsâ€™ lifetimes.
 event_col (string, optional) â€“ the name of thecolumn in dataframe that contains the subjectsâ€™ death observation. If left as None, assume all individuals are uncensored.
 weights_col (string, optional) â€“ an optional column in the dataframe, df, that denotes the weight per subject. This column is expelled and not used as a covariate, but as a weight in the final regression. Default weight is 1. This can be used for caseweights. For example, a weight of 2 means there were two subjects with identical observations. This can be used for sampling weights.
 show_progress (boolean, optional (default=False)) â€“ Since the fitter is iterative, show iteration number.
Returns: self â€“ self with additional new properties:
cumulative_hazards_
, etc.Return type: Examples
>>> from lifelines import AalenAdditiveFitter >>> >>> df = pd.DataFrame({ >>> 'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> 'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0], >>> 'var': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2], >>> 'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> }) >>> >>> aaf = AalenAdditiveFitter() >>> aaf.fit(df, 'T', 'E') >>> aaf.predict_median(df) >>> aaf.print_summary()

plot
(columns=None, loc=None, iloc=None, **kwargs)Â¶ â€ť A wrapper around plotting. Matplotlib plot arguments can be passed in, plus:
Parameters: columns (string or listlike, optional) â€“ If not empty, plot a subset of columns from the
cumulative_hazards_
. Default all.loc
iloc (slice, optional) â€“
 specify a locationbased subsection of the curves to plot, ex:
.plot(iloc=slice(0,10))
will plot the first 10 time points.

predict_cumulative_hazard
(X)Â¶ Returns the hazard rates for the individuals
Parameters: X (a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns) â€“ can be in any order. If a numpy array, columns must be in the same order as the training data.

predict_expectation
(X)Â¶ Compute the expected lifetime, E[T], using covariates X.
Parameters:  X (a (n,d) covariate numpy array or DataFrame) â€“ If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 Returns the expected lifetimes for the individuals

predict_median
(X)Â¶ Parameters:  X (a (n,d) covariate numpy array or DataFrame) â€“ If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 Returns the median lifetimes for the individuals

predict_percentile
(X, p=0.5)Â¶ Returns the median lifetimes for the individuals. http://stats.stackexchange.com/questions/102986/percentilelossfunctions
Parameters:  X (a (n,d) covariate numpy array or DataFrame) â€“ If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 p (float) â€“ default: 0.5

predict_survival_function
(X)Â¶ Returns the survival functions for the individuals
Parameters: X (a (n,d) covariate numpy array or DataFrame) â€“ If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

score_
Â¶ The concordance score (also known as the cindex) of the fit. The cindex is a generalization of the AUC to survival data, including censorships.
For this purpose, the
score_
is a measure of the predictive accuracy of the fitted model onto the training dataset. Itâ€™s analgous to the R^2 in linear models.

smoothed_hazards_
(bandwidth=1)Â¶ Using the epanechnikov kernel to smooth the hazard function, with sigma/bandwidth

summary
Â¶ Summary statistics describing the fit. Set alpha property in the object before calling.
Returns: df Return type: DataFrame

class
lifelines.
BreslowFlemingHarringtonFitter
(alpha=0.95)Â¶ Bases:
lifelines.fitters.UnivariateFitter
Class for fitting the BreslowFlemingHarrington estimate for the survival function. This estimator is a biased estimator of the survival function but is more stable when the popualtion is small and there are too few early truncation times, it may happen that is the number of patients at risk and the number of deaths is the same.
Mathematically, the NAF estimator is the negative logarithm of the BFH estimator.
BreslowFlemingHarringtonFitter(alpha=0.95)
Parameters: alpha (float) â€“ The alpha value associated with the confidence intervals. 
conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

cumulative_hazard_at_times
(times, label=None)Â¶

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed=None, timeline=None, entry=None, label='BFH_estimate', alpha=None, ci_labels=None)Â¶ Parameters:  durations (an array, or pd.Series, of length n) â€“ duration subject was observed for
 timeline â€“ return the best estimate at the values in timelines (postively increasing)
 event_observed (an array, or pd.Series, of length n) â€“ True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 entry (an array, or pd.Series, of length n) â€“ relative time when a subject entered the study. This is useful for lefttruncated observations, i.e the birth event was not observed. If None, defaults to all 0 (all birth events observed.)
 label (string) â€“ a string to name the column of the estimate.
 alpha (float) â€“ the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (iterable) â€“ add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
Returns: Return type: self, with new properties like
survival_function_
.

hazard_at_times
(times, label=None)Â¶

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

survival_function_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted survival value at specific times
Parameters: times (iterable or float) Returns: Return type: pd.Series


class
lifelines.
CoxPHFitter
(alpha=0.95, tie_method='Efron', penalizer=0.0, strata=None)Â¶ Bases:
lifelines.fitters.BaseFitter
This class implements fitting Coxâ€™s proportional hazard model:
\[h(tx) = h_0(t) \exp(x \beta)\]Parameters: alpha (float, optional (default=0.95)) â€“ the level in the confidence intervals.
tie_method (string, optional) â€“ specify how the fitter should deal with ties. Currently only â€Efronâ€™ is available.
penalizer (float, optional (default=0.0)) â€“ Attach a L2 penalizer to the size of the coeffcients during regression. This improves stability of the estimates and controls for high correlation between covariates. For example, this shrinks the absolute value of \(\beta_i\). Recommended, even if a small value. The penalty is \(1/2 \text{penalizer} beta^2\).
strata (list, optional) â€“
 specify a list of columns to use in stratification. This is useful if a
catagorical covariate does not obey the proportional hazard assumption. This is used similar to the strata expression in R. See http://courses.washington.edu/b515/l17.pdf.
Examples
>>> from lifelines.datasets import load_rossi >>> from lifelines import CoxPHFitter >>> rossi = load_rossi() >>> cph = CoxPHFitter() >>> cph.fit(rossi, 'week', 'arrest') >>> cph.print_summary()

check_assumptions
(training_df, advice=True, show_plots=False, p_value_threshold=0.05, plot_n_bootstraps=10)Â¶ Use this function to test the proportional hazards assumption. See iterative usage example at https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
Parameters:  training_df (DataFrame) â€“ the original DataFrame used in the call to
fit(...)
 advice (boolean, optional) â€“ display advice as output to the userâ€™s screen
 show_plots (boolean, optional) â€“ display plots of the scaled schoenfeld residuals and loess curves. This is an eyeball test for violations. This will slow down the function significantly.
 p_value_threshold (float, optional) â€“ the threshold to use to alert the user of violations. See note below.
 plot_n_bootstraps â€“ in the plots displayed, also display plot_n_bootstraps bootstrapped loess curves. This will slow down the function significantly.
Examples
>>> from lifelines.datasets import load_rossi >>> from lifelines import CoxPHFitter >>> >>> rossi = load_rossi() >>> cph = CoxPHFitter().fit(rossi, 'week', 'arrest') >>> >>> cph.check_assumptions(rossi)
Notes
The
p_value_threshold
is arbitrarily set at 0.05. Under the null, some covariates will be below the threshold (i.e. by chance). This is compounded when there are many covariates.Similarly, when there are lots of observations, even minor deviances from the proportional hazard assumption will be flagged.
With that in mind, itâ€™s best to use a combination of statistical tests and eyeball tests to determine the most serious violations.
References
section 5 in https://socialsciences.mcmaster.ca/jfox/Books/Companion/appendices/AppendixCoxRegression.pdf http://www.mwsug.org/proceedings/2006/stats/MWSUG2006SD08.pdf http://eprints.lse.ac.uk/84988/1/06_ParkHendry2015ReassessingSchoenfeldTests_Final.pdf
 training_df (DataFrame) â€“ the original DataFrame used in the call to

compute_residuals
(training_dataframe, kind)Â¶ Parameters:  training_dataframe (pandas DataFrame) â€“ the same training dataframe given in fit
 kind (string) â€“ {â€schoenfeldâ€™, â€scoreâ€™, â€delta_betaâ€™, â€devianceâ€™, â€martingaleâ€™, â€scaled_schoenfeldâ€™}

fit
(df, duration_col=None, event_col=None, show_progress=False, initial_beta=None, strata=None, step_size=None, weights_col=None, cluster_col=None, robust=False, batch_mode=None)Â¶ Fit the Cox Propertional Hazard model to a dataset.
Parameters:  df (DataFrame) â€“ a Pandas dataframe with necessary columns duration_col and event_col (see below), covariates columns, and special columns (weights, strata). duration_col refers to the lifetimes of the subjects. event_col refers to whether the â€deathâ€™ events was observed: 1 if observed, 0 else (censored).
 duration_col (string) â€“ the name of the column in dataframe that contains the subjectsâ€™ lifetimes.
 event_col (string, optional) â€“ the name of thecolumn in dataframe that contains the subjectsâ€™ death observation. If left as None, assume all individuals are uncensored.
 weights_col (string, optional) â€“ an optional column in the dataframe, df, that denotes the weight per subject. This column is expelled and not used as a covariate, but as a weight in the final regression. Default weight is 1. This can be used for caseweights. For example, a weight of 2 means there were two subjects with identical observations. This can be used for sampling weights. In that case, use robust=True to get more accurate standard errors.
 show_progress (boolean, optional (default=False)) â€“ since the fitter is iterative, show convergence diagnostics. Useful if convergence is failing.
 initial_beta (numpy array, optional) â€“ initialize the starting point of the iterative algorithm. Default is the zero vector.
 strata (list or string, optional) â€“ specify a column or list of columns n to use in stratification. This is useful if a catagorical covariate does not obey the proportional hazard assumption. This is used similar to the strata expression in R. See http://courses.washington.edu/b515/l17.pdf.
 step_size (float, optional) â€“ set an initial step size for the fitting algorithm.
 robust (boolean, optional (default=False)) â€“ Compute the robust errors using the Huber sandwich estimator, aka WeiLin estimate. This does not handle ties, so if there are high number of ties, results may significantly differ. See â€śThe Robust Inference for the Cox Proportional Hazards Modelâ€ť, Journal of the American Statistical Association, Vol. 84, No. 408 (Dec., 1989), pp. 1074 1078
 cluster_col (string, optional) â€“ specifies what column has unique identifers for clustering covariances. Using this forces the sandwich estimator (robust variance estimator) to be used.
 batch_mode (bool, optional) â€“ enabling batch_mode can be faster for datasets with a large number of ties. If left as None, lifelines will choose the best option.
Returns: self â€“ self with additional new properties:
print_summary
,hazards_
,confidence_intervals_
,baseline_survival_
, etc.Return type: Note
Tied survival times are handled using Efronâ€™s tiemethod.
Examples
>>> from lifelines import CoxPHFitter >>> >>> df = pd.DataFrame({ >>> 'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> 'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0], >>> 'var': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2], >>> 'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> }) >>> >>> cph = CoxPHFitter() >>> cph.fit(df, 'T', 'E') >>> cph.print_summary() >>> cph.predict_median(df)
>>> from lifelines import CoxPHFitter >>> >>> df = pd.DataFrame({ >>> 'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> 'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0], >>> 'var': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2], >>> 'weights': [1.1, 0.5, 2.0, 1.6, 1.2, 4.3, 1.4, 4.5, 3.0, 3.2, 0.4, 6.2], >>> 'month': [10, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> 'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> }) >>> >>> cph = CoxPHFitter() >>> cph.fit(df, 'T', 'E', strata=['month', 'age'], robust=True, weights_col='weights') >>> cph.print_summary() >>> cph.predict_median(df)

plot
(columns=None, **errorbar_kwargs)Â¶ Produces a visual representation of the coefficients, including their standard errors and magnitudes.
Parameters:  columns (list, optional) â€“ specify a subset of the columns to plot
 errorbar_kwargs â€“ pass in additional plotting commands to matplotlib errorbar command
Returns: ax â€“ the matplotlib axis that be edited.
Return type: matplotlib axis

plot_covariate_groups
(covariate, values, **kwargs)Â¶ Produces a visual representation comparing the baseline survival curve of the model versus what happens when a covariate is varied over values in a group. This is useful to compare subjectsâ€™ survival as we vary a single covariate, all else being held equal. The baseline survival curve is equal to the predicted survival curve at all average values in the original dataset.
Parameters:  covariate (string) â€“ a string of the covariate in the original dataset that we wish to vary.
 values (iterable) â€“ an iterable of the values we wish the covariate to take on.
 kwargs â€“ pass in additional plotting commands
Returns: ax â€“ the matplotlib axis that be edited.
Return type: matplotlib axis, or list of axisâ€™

predict_cumulative_hazard
(X, times=None)Â¶ Parameters:  X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 times (iterable, optional) â€“ an iterable of increasing times to predict the cumulative hazard at. Default is the set of all durations (observed and unobserved). Uses a linear interpolation if points in time are not in the index.
Returns: cumulative_hazard_ â€“ the cumulative hazard of individuals over the timeline
Return type: DataFrame

predict_expectation
(X)Â¶ Compute the expected lifetime, \(E[T]\), using covarites X. This algorithm to compute the expection is to use the fact that \(E[T] = \int_0^\inf P(T > t) dt = \int_0^\inf S(t) dt\). To compute the integal, we use the trapizoidal rule to approximate the integral.
However, if the survival function doesnâ€™t converge to 0, the the expectation is really infinity and the returned values are meaningless/too large. In that case, using
predict_median
orpredict_percentile
would be better.Parameters: X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns: expectations Return type: DataFrame Notes
If X is a dataframe, the order of the columns do not matter. But if X is an array, then the column ordering is assumed to be the same as the training dataset.
See also

predict_log_partial_hazard
(X)Â¶ This is equivalent to Râ€™s linear.predictors. Returns the log of the partial hazard for the individuals, partial since the baseline hazard is not included. Equal to \(\beta (X  mean(X_{train}))\)
Parameters: X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns: log_partial_hazard Return type: DataFrame Notes
If X is a dataframe, the order of the columns do not matter. But if X is an array, then the column ordering is assumed to be the same as the training dataset.

predict_median
(X)Â¶ Predict the median lifetimes for the individuals. If the survival curve of an individual does not cross 0.5, then the result is infinity.
Parameters: X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns: percentiles â€“ the median lifetimes for the individuals. If the survival curve of an individual does not cross 0.5, then the result is infinity. Return type: DataFrame See also

predict_partial_hazard
(X)Â¶ Parameters: X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns: partial_hazard â€“ Returns the partial hazard for the individuals, partial since the baseline hazard is not included. Equal to \(\exp{\beta (X  mean(X_{train}))}\) Return type: DataFrame Notes
If X is a dataframe, the order of the columns do not matter. But if X is an array, then the column ordering is assumed to be the same as the training dataset.

predict_percentile
(X, p=0.5)Â¶ Returns the median lifetimes for the individuals, by default. If the survival curve of an individual does not cross 0.5, then the result is infinity. http://stats.stackexchange.com/questions/102986/percentilelossfunctions
Parameters:  X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 p (float, optional (default=0.5)) â€“ the percentile, must be between 0 and 1.
Returns: percentiles
Return type: DataFrame
See also

predict_survival_function
(X, times=None)Â¶ Predict the survival function for individuals, given their covariates. This assumes that the individual just entered the study (that is, we do not condition on how long they have already lived for.)
Parameters:  X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 times (iterable, optional) â€“ an iterable of increasing times to predict the cumulative hazard at. Default is the set of all durations (observed and unobserved). Uses a linear interpolation if points in time are not in the index.
Returns: survival_function â€“ the survival probabilities of individuals over the timeline
Return type: DataFrame

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 alpha (float or iterable) â€“ specify confidence intervals to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

score_
Â¶ The concordance score (also known as the cindex) of the fit. The cindex is a generalization of the AUC to survival data, including censorships.
For this purpose, the
score_
is a measure of the predictive accuracy of the fitted model onto the training dataset. Itâ€™s analgous to the R^2 in linear models.

summary
Â¶ Summary statistics describing the fit. Set alpha property in the object before calling.
Returns: df â€“ Contains columns coef, np.exp(coef), se(coef), z, p, lower, upper Return type: DataFrame

class
lifelines.
WeibullFitter
(*args, **kwargs)Â¶ Bases:
lifelines.fitters.KnownModelParametericUnivariateFitter
This class implements a Weibull model for univariate data. The model has parameterized form:
\[S(t) = \exp((\lambda t)^\rho), \lambda > 0, \rho > 0,\]which implies the cumulative hazard rate is
\[H(t) = (\lambda t)^\rho,\]and the hazard rate is:
\[h(t) = \rho \lambda(\lambda t)^{\rho1}\]After calling the .fit method, you have access to properties like:
cumulative_hazard_
,survival_function_
,lambda_
andrho_
. A summary of the fit is available with the methodprint_summary()
.Examples
>>> from lifelines import WeibullFitter >>> from lifelines.datasets import load_waltons >>> waltons = load_waltons() >>> wbf = WeibullFitter() >>> wbf.fit(waltons['T'], waltons['E']) >>> wbf.plot() >>> print(wbf.lambda_)

conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

confidence_interval_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_cumulative_hazard_
.

confidence_interval_cumulative_hazard_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_
.

confidence_interval_hazard_
Â¶ The confidence interval of the hazard.

confidence_interval_survival_function_
Â¶ The confidence interval of the survival function.

cumulative_hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted cumulative hazard value at specific times.
Parameters:  times (iterable or float) â€“ values to return the cumulative hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None)Â¶ Parameters:  durations (an array, or pd.Series) â€“ length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) â€“ length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) â€“ return the estimate at the values in timeline (postively increasing)
 label (string, optional) â€“ a string to name the column of the estimate.
 alpha (float, optional) â€“ the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) â€“ add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) â€“ since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) â€“ relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were â€śbornâ€ť: time zero.
Returns: self â€“ self with new properties like
cumulative_hazard_
,survival_function_
Return type:

hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted hazard at specific times.
Parameters:  times (iterable or float) â€“ values to return the hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

median_
Â¶ Return the unique time point, t, such that S(t) = 0.5. This is the â€śhalflifeâ€ť of the population, and a robust summary statistic for the population, if it exists.

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

summary
Â¶ Summary statistics describing the fit.
Returns: df â€“ Contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: pd.DataFrame See also
print_summary

survival_function_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted survival value at specific times.
Parameters:  times (iterable or float) â€“ values to return the survival function at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series


class
lifelines.
ExponentialFitter
(*args, **kwargs)Â¶ Bases:
lifelines.fitters.KnownModelParametericUnivariateFitter
This class implements an Exponential model for univariate data. The model has parameterized form:
\[S(t) = \exp(\lambda t), \lambda >0\]which implies the cumulative hazard rate is
\[H(t) = \lambda t\]and the hazard rate is:
\[h(t) = \lambda\]After calling the .fit method, you have access to properties like:
survival_function_
,lambda_
,cumulative_hazard_
A summary of the fit is available with the methodprint_summary()
Notes
Reference: https://www4.stat.ncsu.edu/~dzhang2/st745/chap3.pdf

conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

confidence_interval_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_cumulative_hazard_
.

confidence_interval_cumulative_hazard_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_
.

confidence_interval_hazard_
Â¶ The confidence interval of the hazard.

confidence_interval_survival_function_
Â¶ The confidence interval of the survival function.

cumulative_hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted cumulative hazard value at specific times.
Parameters:  times (iterable or float) â€“ values to return the cumulative hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None)Â¶ Parameters:  durations (an array, or pd.Series) â€“ length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) â€“ length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) â€“ return the estimate at the values in timeline (postively increasing)
 label (string, optional) â€“ a string to name the column of the estimate.
 alpha (float, optional) â€“ the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) â€“ add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) â€“ since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) â€“ relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were â€śbornâ€ť: time zero.
Returns: self â€“ self with new properties like
cumulative_hazard_
,survival_function_
Return type:

hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted hazard at specific times.
Parameters:  times (iterable or float) â€“ values to return the hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

median_
Â¶ Return the unique time point, t, such that S(t) = 0.5. This is the â€śhalflifeâ€ť of the population, and a robust summary statistic for the population, if it exists.

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

summary
Â¶ Summary statistics describing the fit.
Returns: df â€“ Contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: pd.DataFrame See also
print_summary

survival_function_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted survival value at specific times.
Parameters:  times (iterable or float) â€“ values to return the survival function at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series


class
lifelines.
CoxTimeVaryingFitter
(alpha=0.95, penalizer=0.0, strata=None)Â¶ Bases:
lifelines.fitters.BaseFitter
This class implements fitting Coxâ€™s timevarying proportional hazard model:
\[h(tx(t)) = h_0(t)*\exp(x(t)'*beta)\]Parameters:  alpha (float, optional) â€“ the level in the confidence intervals.
 penalizer (float, optional) â€“ the coefficient of an l2 penalizer in the regression

fit
(df, id_col, event_col, start_col='start', stop_col='stop', weights_col=None, show_progress=False, step_size=None, robust=False, strata=None)Â¶ Fit the Cox Propertional Hazard model to a time varying dataset. Tied survival times are handled using Efronâ€™s tiemethod.
Parameters:  df (DataFrame) â€“ a Pandas dataframe with necessary columns duration_col and event_col, plus other covariates. duration_col refers to the lifetimes of the subjects. event_col refers to whether the â€deathâ€™ events was observed: 1 if observed, 0 else (censored).
 id_col (string) â€“ A subject could have multiple rows in the dataframe. This column contains the unique identifer per subject.
 event_col (string) â€“ the column in dataframe that contains the subjectsâ€™ death observation. If left as None, assume all individuals are noncensored.
 start_col (string) â€“ the column that contains the start of a subjectâ€™s time period.
 stop_col (string) â€“ the column that contains the end of a subjectâ€™s time period.
 weights_col (string, optional) â€“ the column that contains (possibly timevarying) weight of each subjectperiod row.
 show_progress (since the fitter is iterative, show convergence) â€“ diagnostics.
 robust (boolean, optional (default: True)) â€“ Compute the robust errors using the Huber sandwich estimator, aka WeiLin estimate. This does not handle ties, so if there are high number of ties, results may significantly differ. See â€śThe Robust Inference for the Cox Proportional Hazards Modelâ€ť, Journal of the American Statistical Association, Vol. 84, No. 408 (Dec., 1989), pp. 1074 1078
 step_size (float, optional) â€“ set an initial step size for the fitting algorithm.
 strata (TODO)
Returns: self â€“ self, with additional properties like
hazards_
andprint_summary
Return type:

plot
(columns=None, **errorbar_kwargs)Â¶ Produces a visual representation of the coefficients, including their standard errors and magnitudes.
Parameters:  columns (list, optional) â€“ specifiy a subset of the columns to plot
 errorbar_kwargs â€“ pass in additional plotting commands to matplotlib errorbar command
Returns: ax â€“ the matplotlib axis that be edited.
Return type: matplotlib axis

predict_log_partial_hazard
(X)Â¶ This is equivalent to Râ€™s linear.predictors. Returns the log of the partial hazard for the individuals, partial since the baseline hazard is not included. Equal to \(\beta (X  \bar{X})\)
Parameters: X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns: Return type: DataFrame Note
If X is a dataframe, the order of the columns do not matter. But if X is an array, then the column ordering is assumed to be the same as the training dataset.

predict_partial_hazard
(X)Â¶ Returns the partial hazard for the individuals, partial since the baseline hazard is not included. Equal to \(\exp{\beta (X  \bar{X})}\)
Parameters: X (numpy array or DataFrame) â€“ a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns: Return type: DataFrame Note
If X is a dataframe, the order of the columns do not matter. But if X is an array, then the column ordering is assumed to be the same as the training dataset.

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

summary
Â¶ Summary statistics describing the fit. Set alpha property in the object before calling.
Returns: df â€“ contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: DataFrame

class
lifelines.
AalenJohansenFitter
(jitter_level=0.0001, seed=None, alpha=0.95, calculate_variance=True)Â¶ Bases:
lifelines.fitters.UnivariateFitter
Class for fitting the AalenJohansen estimate for the cumulative incidence function in a competing risks framework. Treating competing risks as censoring can result in overestimated cumulative density functions. Using the Kaplan Meier estimator with competing risks as censored is akin to estimating the cumulative density if all competing risks had been prevented.
AalenJohansen cannot deal with tied times. We can get around this by randomy jittering the event times slightly. This will be done automatically and generates a warning.
Parameters:  alpha (float, option (default=0.95)) â€“ The alpha value associated with the confidence intervals.
 jitter_level (float, option (default=0.00001)) â€“ If tied event times are detected, event times are randomly changed by this factor.
 seed (int, option (default=None)) â€“ To produce replicate results with tied event times, the numpy.random.seed can be specified in the function.
 calculate_variance (bool, option (default=True)) â€“ By default, AalenJohansenFitter calculates the variance and corresponding confidence intervals. Due to how the variance is calculated, the variance must be calculated for each event time individually. This is computationally intensive. For some procedures, like bootstrapping, the variance is not necessary. To reduce computation time during these procedures, calculate_variance can be set to False to skip the variance calculation.
Example
>>> AalenJohansenFitter(alpha=0.95, jitter_level=0.00001, seed=None, calculate_variance=True)
References
If you are interested in learning more, we recommend the following openaccess paper; Edwards JK, Hester LL, Gokhale M, Lesko CR. Methodologic Issues When Estimating Risks in Pharmacoepidemiology. Curr Epidemiol Rep. 2016;3(4):285296.

conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

cumulative_hazard_at_times
(times, label=None)Â¶

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed, event_of_interest, timeline=None, entry=None, label='AJ_estimate', alpha=None, ci_labels=None, weights=None)Â¶ Parameters:  durations (an array or pd.Series of length n â€“ duration of subject was observed for)
 event_observed (an array, or pd.Series, of length n. Integer indicator of distinct events. Must be) â€“ only positive integers, where 0 indicates censoring.
 event_of_interest (integer â€“ indicator for event of interest. All other integers are considered competing events) â€“ Ex) event_observed contains 0, 1, 2 where 0:censored, 1:lung cancer, and 2:death. If event_of_interest=1, then death (2) is considered a competing event. The returned cumulative incidence function corresponds to risk of lung cancer
 timeline (return the best estimate at the values in timelines (postively increasing))
 entry (an array, or pd.Series, of length n â€“ relative time when a subject entered the study. This is) â€“ useful for lefttruncated (not leftcensored) observations. If None, all members of the population were born at time 0.
 label (a string to name the column of the estimate.)
 alpha (the alpha value in the confidence intervals. Overrides the initializing) â€“ alpha for this call to fit only.
 ci_labels (add custom column names to the generated confidence intervals) â€“ as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 weights (n array, or pd.Series, of length n, if providing a weighted dataset. For example, instead) â€“ of providing every subject as a single element of durations and event_observed, one could weigh subject differently.
Returns: self â€“ self, with new properties like
cumulative_incidence_
.Return type:

hazard_at_times
(times, label=None)Â¶

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

survival_function_at_times
(times, label=None)Â¶

class
lifelines.
LogNormalFitter
(*args, **kwargs)Â¶ Bases:
lifelines.fitters.KnownModelParametericUnivariateFitter
This class implements an Log Normal model for univariate data. The model has parameterized form:
\[S(t) = 1  \Phi((\log(t)  \mu)/\sigma), \sigma >0\]where \(\Phi\) is the CDF of a standard normal random variable. This implies the cumulative hazard rate is
\[H(t) = \log(1  \Phi((\log(t)  \mu)/\sigma))\]After calling the .fit method, you have access to properties like:
survival_function_
,mu_
,sigma_
. A summary of the fit is available with the methodprint_summary()

conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

confidence_interval_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_cumulative_hazard_
.

confidence_interval_cumulative_hazard_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_
.

confidence_interval_hazard_
Â¶ The confidence interval of the hazard.

confidence_interval_survival_function_
Â¶ The confidence interval of the survival function.

cumulative_hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted cumulative hazard value at specific times.
Parameters:  times (iterable or float) â€“ values to return the cumulative hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None)Â¶ Parameters:  durations (an array, or pd.Series) â€“ length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) â€“ length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) â€“ return the estimate at the values in timeline (postively increasing)
 label (string, optional) â€“ a string to name the column of the estimate.
 alpha (float, optional) â€“ the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) â€“ add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) â€“ since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) â€“ relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were â€śbornâ€ť: time zero.
Returns: self â€“ self with new properties like
cumulative_hazard_
,survival_function_
Return type:

hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted hazard at specific times.
Parameters:  times (iterable or float) â€“ values to return the hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

median_
Â¶ Return the unique time point, t, such that S(t) = 0.5. This is the â€śhalflifeâ€ť of the population, and a robust summary statistic for the population, if it exists.

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

summary
Â¶ Summary statistics describing the fit.
Returns: df â€“ Contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: pd.DataFrame See also
print_summary

survival_function_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted survival value at specific times.
Parameters:  times (iterable or float) â€“ values to return the survival function at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series


class
lifelines.
LogLogisticFitter
(*args, **kwargs)Â¶ Bases:
lifelines.fitters.KnownModelParametericUnivariateFitter
This class implements a LogLogistic model for univariate data. The model has parameterized form:
\[S(t) = \left(1 + \left(\frac{t}{\alpha}\right)^{\beta}\right)^{1}, \alpha > 0, \beta > 0,\]and the hazard rate is:
\[h(t) = \frac{\left(\frac{\beta}{\alpha}\right)\left(\frac{t}{\alpha}\right) ^ {\beta1}}{\left(1 + \left(\frac{t}{\alpha}\right)^{\beta}\right)}\]and the cumulative hazard is:
\[H(t) = \log\left(\left(\frac{t}{\alpha}\right) ^ {\beta} + 1\right)\]After calling the .fit method, you have access to properties like:
cumulative_hazard_
,plot
,survival_function_
,alpha_
andbeta_
. A summary of the fit is available with the method â€print_summary()â€™Examples
>>> from lifelines import LogLogisticFitter >>> from lifelines.datasets import load_waltons >>> waltons = load_waltons() >>> llf = WeibullFitter() >>> llf.fit(waltons['T'], waltons['E']) >>> llf.plot() >>> print(llf.alpha_)

conditional_time_to_event_
Â¶ Return a DataFrame, with index equal to
survival_function_
, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ â€“ with index equal to survival_function_
Return type: DataFrame

confidence_interval_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_cumulative_hazard_
.

confidence_interval_cumulative_hazard_
Â¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_
.

confidence_interval_hazard_
Â¶ The confidence interval of the hazard.

confidence_interval_survival_function_
Â¶ The confidence interval of the survival function.

cumulative_hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted cumulative hazard value at specific times.
Parameters:  times (iterable or float) â€“ values to return the cumulative hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

divide
(other)Â¶ Divide the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

fit
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None)Â¶ Parameters:  durations (an array, or pd.Series) â€“ length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) â€“ length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) â€“ return the estimate at the values in timeline (postively increasing)
 label (string, optional) â€“ a string to name the column of the estimate.
 alpha (float, optional) â€“ the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) â€“ add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) â€“ since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) â€“ relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were â€śbornâ€ť: time zero.
Returns: self â€“ self with new properties like
cumulative_hazard_
,survival_function_
Return type:

hazard_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted hazard at specific times.
Parameters:  times (iterable or float) â€“ values to return the hazard at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

median_
Â¶ Return the unique time point, t, such that S(t) = 0.5. This is the â€śhalflifeâ€ť of the population, and a robust summary statistic for the population, if it exists.

plot
(**kwargs)Â¶

plot_cumulative_hazard
(**kwargs)Â¶

plot_hazard
(**kwargs)Â¶

plot_survival_function
(**kwargs)Â¶

predict
(times)Â¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters: times (a scalar or an array of times to predict the value of {0} at.) Returns: predictions Return type: a scalar if time is a scalar, a numpy array if time in an array.

print_summary
(decimals=2, **kwargs)Â¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) â€“ specify the number of decimal places to show
 kwargs â€“ print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

subtract
(other)Â¶ Subtract the {0} of two {1} objects.
Parameters: other (an {1} fitted instance.)

summary
Â¶ Summary statistics describing the fit.
Returns: df â€“ Contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: pd.DataFrame See also
print_summary

survival_function_at_times
(times, label=None)Â¶ Return a Pandas series of the predicted survival value at specific times.
Parameters:  times (iterable or float) â€“ values to return the survival function at.
 label (string, optional) â€“ Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

ChangelogÂ¶
0.18.6Â¶
 some improvements to the output of
check_assumptions
.show_plots
is turned toFalse
by default now. It only showsrank
andkm
pvalues now.  some performance improvements to
qth_survival_time
.
0.18.5Â¶
 added new plotting methods to parametric univariate models:
plot_survival_function
,plot_hazard
andplot_cumulative_hazard
. The last one is an alias forplot
.  added new properties to parametric univarite models:
confidence_interval_survival_function_
,confidence_interval_hazard_
,confidence_interval_cumulative_hazard_
. The last one is an alias forconfidence_interval_
.  Fixed some overflow issues with
AalenJohansenFitter
â€™s variance calculations when using large datasets.  Fixed an edgecase in
AalenJohansenFitter
that causing some datasets with to be jittered too often.  Add a new kwarg to
AalenJohansenFitter
,calculate_variance
that can be used to turn off variance calculations since this can take a long time for large datasets. Thanks @pzivich!
0.18.4Â¶
 fixed confidence intervals in cumulative hazards for parametric univarite models. They were previously serverly depressed.
 adding lefttruncation support to parametric univarite models with
the
entry
kwarg in.fit
0.18.3Â¶
 Some performance improvements to parametric univariate models.
 Suppressing some irrelevant NumPy and autograd warnings, so lifeline warnings are more noticeable.
 Improved some warning and error messages.
0.18.2Â¶
 New univariate fitter
PiecewiseExponentialFitter
for creating a stepwise hazard model. See docs online.  Ability to create novel parametric univariate models using the new
ParametericUnivariateFitter
super class. See docs online for how to do this.  Unfortunately, parametric univariate fitters are not serializable
with
pickle
. The librarydill
is still useable.  Complete overhaul of all internals for parametric univariate fitters.
Moved them all (most) to use
autograd
. LogNormalFitter
no longer modelslog_sigma
.
0.18.1Â¶
 bug fixes in
LogNormalFitter
variance estimates  improve convergence of
LogNormalFitter
. We now model the log of sigma internally, but still expose sigma externally.  use the
autograd
lib to help with gradients.  New
LogLogisticFitter
univariate fitter available.
0.18.0Â¶
LogNormalFitter
is a new univariate fitter you can use.WeibullFitter
now correctly returns the confidence intervals (previously returned only NaNs)WeibullFitter.print_summary()
displays pvalues associated with its parameters not equal to 1.0  previously this was (implicitly) comparing against 0, which is trivially always true (the parameters must be greater than 0)ExponentialFitter.print_summary()
displays pvalues associated with its parameters not equal to 1.0  previously this was (implicitly) comparing against 0, which is trivially always true (the parameters must be greater than 0)ExponentialFitter.plot
now displays the cumulative hazard, instead of the survival function. This is to make it easier to compare toWeibullFitter
andLogNormalFitter
 Univariate fittersâ€™
cumulative_hazard_at_times
,hazard_at_times
,survival_function_at_times
return pandas Series now (use to be numpy arrays)  remove
alpha
keyword from all statistical functions. This was never being used.  Gone are astericks and dots in
print_summary
functions that represent signficance thresholds.  In modelsâ€™
summary
(includingprint_summary
), thelog(p)
term has changed tolog2(p)
. This is known as the svalue. See https://lesslikely.com/statistics/svalues/  introduce new statistical tests between univariate datasets:
survival_difference_at_fixed_point_in_time_test
,â€¦  new warning message when Cox models detects possible nonunique solutions to maximum likelihood.
 Generally: clean up lifelines exception handling. Ex: catch
LinAlgError: Matrix is singular.
and report back to the user advice.
0.17.5Â¶
 more bugs in
plot_covariate_groups
fixed when using nonnumeric strata.
0.17.4Â¶
 Fix bug in
plot_covariate_groups
that wasnâ€™t allowing for strata to be used.  change name of
multicenter_aids_cohort_study
toload_multicenter_aids_cohort_study
groups
is now calledvalues
inCoxPHFitter.plot_covariate_groups
0.17.3Â¶
 Fix in
compute_residuals
when usingschoenfeld
and the minumum duration has only censored subjects.
0.17.2Â¶
 Another round of serious performance improvements for the Cox models.
Up to 2x faster for CoxPHFitter and CoxTimeVaryingFitter. This was
mostly the result of using NumPyâ€™s
einsum
to simplify a previousfor
loop. The downside is the code is more esoteric now. Iâ€™ve added comments as necessary though đź¤ž
0.17.1Â¶
 adding bottleneck as a dependency. This library is highlyrecommended
by Pandas, and in lifelines we see some nice performance improvements
with it too. (~15% for
CoxPHFitter
)  There was a small bug in
CoxPHFitter
when usingbatch_mode
that was causing coefficients to deviate from their MLE value. This bug eluded tests, which means that itâ€™s discrepancy was less than 0.0001 difference. Itâ€™s fixed now, and even more accurate tests are added.  Faster
CoxPHFitter._compute_likelihood_ratio_test()
 Fixes a Pandas performance warning in
CoxTimeVaryingFitter
.  Performances improvements to
CoxTimeVaryingFitter
.
0.17.0Â¶
 corrected behaviour in
CoxPHFitter
wherescore_
was not being refreshed on every newfit
.  Reimplentation of
AalenAdditiveFitter
. There were significant changes to it: implementation is at least 10x faster, and possibly up to 100x faster for some datasets.
 memory consumption is way down
 removed the timevarying component from
AalenAdditiveFitter
. This will return in a future release.  new
print_summary
weights_col
is addednn_cumulative_hazard
is removed (may add back)
 some plotting improvemnts to
plotting.plot_lifetimes
0.16.3Â¶
 More
CoxPHFitter
performance improvements. Up to a 40% reduction vs 0.16.2 for some datasets.
0.16.2Â¶
 Fixed
CoxTimeVaryingFitter
to allow more than one variable to be stratafied  Significant performance improvements for
CoxPHFitter
with dataset has lots of duplicate times. See https://github.com/CamDavidsonPilon/lifelines/issues/591
0.16.1Â¶
 Fixed py2 division error in
concordance
method.
0.16.0Â¶
 Drop Python 3.4 support.
 introduction of residual calculations in
CoxPHFitter.compute_residuals
. Residuals include â€śschoenfeldâ€ť, â€śscoreâ€ť, â€śdelta_betaâ€ť, â€śdevianceâ€ť, â€śmartingaleâ€ť, and â€śscaled_schoenfeldâ€ť.  removes
estimation
namespace for fitters. Should be usingfrom lifelines import xFitter
now. Thanks @usmanatron  removes
predict_log_hazard_relative_to_mean
from Cox model. Thanks @usmanatron StatisticalResult
has be generalized to allow for multiple results (ex: from pairwise comparisons). This means a slightly changed API that is mostly backwards compatible. See doc string for how to use it.statistics.pairwise_logrank_test
now returns aStatisticalResult
object instead of a nasty NxN DataFrame đź’— Display log(pvalues) as well as pvalues in
print_summary
. Also, pvalues below thesholds will be truncated. The orignal pvalues are still recoverable using.summary
.  Floats
print_summary
is now displayed to 2 decimal points. This can be changed using thedecimal
kwarg.  removed
standardized
fromCox
model plotting. It was confusing.  visual improvements to Cox models
.plot
print_summary
methods accepts kwargs to also be displayed.CoxPHFitter
has a new humanreadable method,check_assumptions
, to check the assumptions of your Cox proportional hazard model. A new helper util to â€śexpandâ€ť static datasets into longform:
lifelines.utils.to_episodic_format
. CoxTimeVaryingFitter
now acceptsstrata
.
0.15.4Â¶
 bug fix for the Cox model likelihood ratio test when using nontrivial weights.
0.15.3Â¶
 Only allow matplotlib less than 3.0.
0.15.2Â¶
 API changes to
plotting.plot_lifetimes
cluster_col
andstrata
can be used together inCoxPHFitter
 removed
entry
fromExponentialFitter
andWeibullFitter
as it was doing nothing.
0.15.1Â¶
 Bug fixes for v0.15.0
 Raise NotImplementedError if the
robust
flag is used inCoxTimeVaryingFitter
 thatâ€™s not ready yet.
0.15.0Â¶
 adding
robust
params toCoxPHFitter
â€™sfit
. This enables atleast i) using noninteger weights in the model (these could be sampling weights like IPTW), and ii) misspecified models (ex: nonproportional hazards). Under the hood itâ€™s a sandwich estimator. This does not handle ties, so if there are high number of ties, results may significantly differ from other software. standard_errors_
is now a property on fittedCoxPHFitter
which describes the standard errors of the coefficients.variance_matrix_
is now a property on fittedCoxPHFitter
which describes the variance matrix of the coefficients. new criteria for convergence of
CoxPHFitter
andCoxTimeVaryingFitter
called the Newtondecrement. Tests show it is as accurate (w.r.t to previous coefficients) and typically shaves off a single step, resulting in generally faster convergence. See https://www.cs.cmu.edu/~pradeepr/convexopt/Lecture_Slides/Newton_methods.pdf. Details about the Newtondecrement are added to theshow_progress
statements.  Minimum suppport for scipy is 1.0
 Convergence errors in models that use NewtonRhapson methods now
throw a
ConvergenceError
, instead of aValueError
(the former is a subclass of the latter, however). AalenAdditiveModel
raisesConvergenceWarning
instead of printing a warning.KaplanMeierFitter
now has a cumulative plot option. Examplekmf.plot(invert_y_axis=True)
 a
weights_col
option has been added toCoxTimeVaryingFitter
that allows for timevarying weights. WeibullFitter
has a newshow_progress
param and additional information if the convergence fails.CoxPHFitter
,ExponentialFitter
,WeibullFitter
andCoxTimeVaryFitter
methodprint_summary
is updated with new fields.WeibullFitter
has renamed the incorrect_jacobian
to_hessian_
.variance_matrix_
is now a property on fittedWeibullFitter
which describes the variance matrix of the parameters. The default
WeibullFitter().timeline
has changed from integers between the min and max duration to n floats between the max and min durations, where n is the number of observations.  Performance improvements for
CoxPHFitter
(~20% faster)  Performance improvements for
CoxTimeVaryingFitter
(~100% faster)  In Python3, Univariate models are now serialisable with
pickle
. Thanks @dwilson1988 for the contribution. For Python2,dill
is still the preferred method. baseline_cumulative_hazard_
(and derivatives of that) onCoxPHFitter
now correctly incorporate theweights_col
. Fixed a bug in
KaplanMeierFitter
when late entry times lined up with death events. Thanks @pzivich  Adding
cluster_col
argument toCoxPHFitter
so users can specify groups of subjects/rows that may be correlated.  Shifting the â€śsignficance codesâ€ť for pvalues down an order of
magnitude. (Example, pvalues between 0.1 and 0.05 are not noted at
all and pvalues between 0.05 and 0.1 are noted with
.
, etc.). This deviates with how they are presented in other software. There is an argument to be made to remove pvalues from lifelines altogether (become the changes you want to see in the world lol), but I worry that people could compute the pvalues by hand incorrectly, a worse outcome I think. So, this is my stance. Pvalues between 0.1 and 0.05 offer very little information, so they are removed. There is a growing movement in statistics to shift â€śsignficantâ€ť findings to pvalues less than 0.01 anyways.  New fitter for cumulative incidence of multiple risks
AalenJohansenFitter
. Thanks @pzivich! See â€śMethodologic Issues When Estimating Risks in Pharmacoepidemiologyâ€ť for a nice overview of the model.
0.14.6Â¶
 fix for n > 2 groups in
multivariate_logrank_test
(again).  fix bug for when
event_observed
column was not boolean.
0.14.5Â¶
 fix for n > 2 groups in
multivariate_logrank_test
 fix weights in KaplanMeierFitter when using a pandas Series.
0.14.4Â¶
 Adds
baseline_cumulative_hazard_
andbaseline_survival_
toCoxTimeVaryingFitter
. Because of this, new prediction methods are available.  fixed a bug in
add_covariate_to_timeline
when usingcumulative_sum
with multiple columns.  Added
Likelihood ratio test
toCoxPHFitter.print_summary
andCoxTimeVaryingFitter.print_summary
 New checks in
CoxTimeVaryingFitter
that check for immediate deaths and redundant rows.  New
delay
parameter inadd_covariate_to_timeline
 removed
two_sided_z_test
fromstatistics
0.14.3Â¶
 fixes a bug when subtracting or dividing two
UnivariateFitters
with labels.  fixes an import error with using
CoxTimeVaryingFitter
predict methods.  adds a
column
argument toCoxTimeVaryingFitter
andCoxPHFitter
plot
method to plot only a subset of columns.
0.14.2Â¶
 some quality of life improvements for working with
CoxTimeVaryingFitter
including newpredict_
methods.
0.14.1Â¶
 fixed bug with using weights and strata in
CoxPHFitter
 fixed bug in using noninteger weights in
KaplanMeierFitter
 Performance optimizations in
CoxPHFitter
for up to 40% faster completion offit
. even smarter
step_size
calculations for iterative optimizations.  simple code optimizations & cleanup in specific hot spots.
 even smarter
 Performance optimizations in
AalenAdditiveFitter
for up to 50% faster completion offit
for large dataframes, and up to 10% faster for small dataframes.
0.14.0Â¶
 adding
plot_covariate_groups
toCoxPHFitter
to visualize what happens to survival as we vary a covariate, all else being equal. utils
functions likeqth_survival_times
andmedian_survival_times
now return the transpose of the DataFrame compared to previous version of lifelines. The reason for this is that we often treat survival curves as columns in DataFrames, and functions of the survival curve as index (ex: KaplanMeierFitter.survival_function_ returns a survival curve at time t).KaplanMeierFitter.fit
andNelsonAalenFitter.fit
accept aweights
vector that can be used for preaggregated datasets. See this issue. Convergence errors now return a custom
ConvergenceWarning
instead of aRuntimeWarning
 New checks for complete separation in the dataset for regressions.
0.13.0Â¶
 removes
is_significant
andtest_result
fromStatisticalResult
. Users can instead choose their significance level by comparing top_value
. The string representation of this class has changed aswell. CoxPHFitter
andAalenAdditiveFitter
now have ascore_
property that is the concordanceindex of the dataset to the fitted model.CoxPHFitter
andAalenAdditiveFitter
no longer have thedata
property. It was an almost duplicate of the training data, but was causing the model to be very large when serialized. Implements a new fitter
CoxTimeVaryingFitter
available under thelifelines
namespace. This model implements the Cox model for timevarying covariates.  Utils for creating time varying datasets available in
utils
.  less noisy check for complete separation.
 removed
datasets
namespace from the mainlifelines
namespace CoxPHFitter
has a slightly more intelligent (barelyâ€¦) way to pick a step size, so convergence should generally be faster.CoxPHFitter.fit
now has accepts aweight_col
kwarg so one can pass in weights per observation. This is very useful if you have many subjects, and the space of covariates is not large. Thus you can group the same subjects together and give that observation a weight equal to the count. Altogether, this means a much faster regression.
0.12.0Â¶
 removes
include_likelihood
fromCoxPHFitter.fit
 it was not slowing things down much (empirically), and often I wanted it for debugging (I suppose others do too). Itâ€™s also another exit condition, so we many exit from the NR iterations faster.  added
step_size
param toCoxPHFitter.fit
 the default is good, but for extremely large or small datasets this may want to be set manually.  added a warning to
CoxPHFitter
to check for complete seperation: https://stats.idre.ucla.edu/other/multpkg/faq/general/faqwhatiscompleteorquasicompleteseparationinlogisticprobitregressionandhowdowedealwiththem/  Additional functionality to
utils.survival_table_from_events
to bin the index to make the resulting table more readable.
0.11.3Â¶
 No longer support matplotlib 1.X
 Adding
times
argument toCoxPHFitter
â€™spredict_survival_function
andpredict_cumulative_hazard
to predict the estimates at, instead uses the default times of observation or censorship.  More accurate prediction methods parametrics univariate models.
0.11.2Â¶
 Changing liscense to valilla MIT.
 Speed up
NelsonAalenFitter.fit
considerably.
0.11.1Â¶
 Python3 fix for
CoxPHFitter.plot
.
0.11.0Â¶
 fixes regression in
KaplanMeierFitter.plot
when using Seaborn and lifelines.  introduce a new
.plot
function to a fittedCoxPHFitter
instance. This plots the hazard coefficients and their confidence intervals.  in all plot methods, the
ix
kwarg has been deprecated in favour of a newloc
kwarg. This is to align with Pandas deprecatingix
0.10.1Â¶
 fix in internal normalization for
CoxPHFitter
predict methods.
0.10.0Â¶
 corrected bug that was returning the wrong baseline survival and
hazard values in
CoxPHFitter
whennormalize=True
.  removed
normalize
kwarg inCoxPHFitter
. This was causing lots of confusion for users, and added code complexity. Itâ€™s really nice to be able to remove it.  correcting column name in
CoxPHFitter.baseline_survival_
CoxPHFitter.baseline_cumulative_hazard_
is always centered, to mimic Râ€™sbasehaz
API. new
predict_log_partial_hazards
toCoxPHFitter
0.9.4Â¶
 adding
plot_loglogs
toKaplanMeierFitter
 added a (correct) check to see if some columns in a dataset will cause convergence problems.
 removing
flat
argument inplot
methods. It was causing confusion. To replicate it, one can setci_force_lines=True
andshow_censors=True
.  adding
strata
keyword argument toCoxPHFitter
on initialization (ex:CoxPHFitter(strata=['v1', 'v2'])
. Why? Fitters initialized withstrata
can now be passed intok_fold_cross_validation
, plus it makes unit testingstrata
fitters easier.  If using
strata
inCoxPHFitter
, access to strata specific baseline hazards and survival functions are available (previously it was a blended valie). Prediction also uses the specific baseline hazards/survivals.  performance improvements in
CoxPHFitter
 should see at least a 10% speed improvement infit
.
0.9.2Â¶
 deprecates Pandas versions before 0.18.
 throw an error if no admissable pairs in the cindex calculation. Previously a NaN was returned.
0.9.1Â¶
 add two summary functions to Weibull and Exponential fitter, solves #224
0.9.0Â¶
 new prediction function in
CoxPHFitter
,predict_log_hazard_relative_to_mean
, that mimics what Râ€™spredict.coxph
does.  removing the
predict
method in CoxPHFitter and AalenAdditiveFitter. This is because the choice ofpredict_median
as a default was causing too much confusion, and no other natual choice as a default was available. All otherpredict_
methods remain.  Default predict method in
k_fold_cross_validation
is nowpredict_expectation
0.8.1Â¶
 supports matplotlib 1.5.
 introduction of a param
nn_cumulative_hazards
in AalenAdditiveModelâ€™s__init__
(default True). This parameter will truncate all nonnegative cumulative hazards in prediction methods to 0.  bug fixes including:
 fixed issue where the while loop in
_newton_rhaphson
would break too early causing a variable not to be set properly.  scaling of smooth hazards in NelsonAalenFitter was off by a factor of 0.5.
 fixed issue where the while loop in
0.8.0Â¶
 reorganized lifelines directories:
 moved test files out of main directory.
 moved
utils.py
into itâ€™s own directory.  moved all estimators
fitters
directory.
 added a
at_risk
column to the output ofgroup_survival_table_from_events
andsurvival_table_from_events
 added sample size and power calculations for statistical tests. See
lifeline.statistics. sample_size_necessary_under_cph
andlifelines.statistics. power_under_cph
.  fixed a bug when using KaplanMeierFitter for leftcensored data.
0.7.1Â¶
 addition of a l2
penalizer
toCoxPHFitter
.  dropped Fortran implementation of efficient Python version. Lifelines is pure python once again!
 addition of
strata
keyword argument toCoxPHFitter
to allow for stratification of a single or set of categorical variables in your dataset. datetimes_to_durations
now accepts a list asna_values
, so multiple values can be checked. fixed a bug in
datetimes_to_durations
wherefill_date
was not properly being applied.  Changed warning in
datetimes_to_durations
to be correct.  refactor each fitter into itâ€™s own submodule. For now, the tests are still in the same file. This will also not break the API.
0.7.0Â¶
 allow for multiple fitters to be passed into
k_fold_cross_validation
.  statistical tests in
lifelines.statistics
. now return aStatisticalResult
object with properties likep_value
,test_results
, andsummary
.  fixed a bug in how logrank statistical tests are performed. The covariance matrix was not being correctly calculated. This resulted in slightly different pvalues.
WeibullFitter
,ExponentialFitter
,KaplanMeierFitter
andBreslowFlemingHarringtonFitter
all have aconditional_time_to_event_
property that measures the median duration remaining until the death event, given survival up until time t.
0.6.1Â¶
 addition of
median_
property toWeibullFitter
andExponentialFitter
. WeibullFitter
andExponentialFitter
will use integer timelines instead of float provided bylinspace
. This is so if your work is to sum up the survival function (for expected values or something similar), itâ€™s more difficult to make a mistake.
0.6.0Â¶
 Inclusion of the univariate fitters
WeibullFitter
andExponentialFitter
.  Removing
BayesianFitter
from lifelines.  Added new penalization scheme to AalenAdditiveFitter. You can now add
a smoothing penalizer that will try to keep subsequent values of a
hazard curve close together. The penalizing coefficient is
smoothing_penalizer
.  Changed
penalizer
keyword arg tocoef_penalizer
in AalenAdditiveFitter.  new
ridge_regression
function inutils.py
to perform linear regression with l2 penalizer terms.  Matplotlib is no longer a mandatory dependency.
.predict(time)
method on univariate fitters can now accept a scalar (and returns a scalar) and an iterable (and returns a numpy array) In
KaplanMeierFitter
,epsilon
has been renamed toprecision
.
0.5.1Â¶
 New API for
CoxPHFitter
andAalenAdditiveFitter
: the default arguments forevent_col
andduration_col
.duration_col
is now mandatory, andevent_col
now accepts a column, or by default,None
, which assumes all events are observed (noncensored).  Fix statistical tests.
 Allow negative durations in Fitters.
 New API in
survival_table_from_events
:min_observations
is replaced bybirth_times
(defaultNone
).  New API in
CoxPHFitter
for summary:summary
will return a dataframe with statistics,print_summary()
will print the dataframe (plus some other statistics) in a pretty manner.  Adding â€śAt Riskâ€ť counts option to univariate fitter
plot
methods,.plot(at_risk_counts=True)
, and the functionlifelines.plotting.add_at_risk_counts
.  Fix bug Epanechnikov kernel.
0.5.0Â¶
 move testing to py.test
 refactor tests into smaller files
 make
test_pairwise_logrank_test_with_identical_data_returns_inconclusive
a better test  add test for summary()
 Alternate metrics can be used for
k_fold_cross_validation
.
0.4.4Â¶
 Lots of improvements to numerical stability (but something things still need work)
 Additions to
summary
in CoxPHFitter.  Make all prediction methods output a DataFrame
 Fixes bug in 1d input not returning in CoxPHFitter
 Lots of new tests.
0.4.3Â¶
 refactoring of
qth_survival_times
: it can now accept an iterable (or a scalar still) of probabilities in the q argument, and will return a DataFrame with these as columns. If len(q)==1 and a single survival function is given, will return a scalar, not a DataFrame. Also some good speed improvements.  KaplanMeierFitter and NelsonAalenFitter now have a
_label
property that is passed in during the fit.  KaplanMeierFitter/NelsonAalenFitterâ€™s inital
alpha
value is overwritten if a newalpha
value is passed in during thefit
.  New method for KaplanMeierFitter:
conditional_time_to
. Th