Survival Analysis 101: An Easy Start Guide to Analyzing Time-to-Event Data

Corresponding Author: Quin E. Denfeld, PhD, RN, FAAN, FAHA, FHFSA, Oregon Health & Science University School of Nursing, 3455 S.W. U.S. Veterans Hospital Road | Mail code: SN-ORD, Portland, OR, USA 97239-2941, ude.usho@qdlefned, Twitter: @quin_denfeld

The publisher's final edited version of this article is available at Eur J Cardiovasc Nurs

Abstract

Survival analysis, also called time-to-event analysis, is a common approach to handling event data in cardiovascular nursing and health related research. Survival analysis is used to describe, explain, and/or predict the occurrence and timing of events. There is a specific language used and methods designed to handle the unique nature of event data. In this methods paper, we provide an “easy start guide” to using survival analysis by a) providing a step-by-step guide, and b) applying the steps with example data. Specifically, we analyze cardiovascular event data over 6 months in a sample of patients with heart failure.

Introduction

Survival analysis, also called time-to-event or event history analysis, is a long-standing approach to handling time to a particular event(s) or outcomes. 1 There are many examples of survival analysis as applied to cardiovascular nursing and health related research questions. 2–4 Despite the common use of survival analysis, there are misconceptions about when to use it and how to use it appropriately. Moreover, compared with more commonly used regression models (e.g. logistic regression), there is confusion about how to handle time in the analysis. Herein, we provide an “easy start guide” to using survival analysis by first providing a step-by-step guide, and then applying the steps with example data.

Key Features of Survival Analysis

Survival analysis is used to describe, explain, and/or predict the occurrence and timing of events. 1 Survival analysis is unique because it simultaneously considers if events happened (i.e. a binary outcome), and when events happened (e.g. a continuous outcome). As such, survival analysis is best applied to research questions concerning both the occurrence and timing of events. Combining these two important elements of survival is not straightforward, however, because of two major issues: 1) how best to incorporate time, and 2) how best to handle participants who do not have events.

A common alternative to survival analysis in the handling of event data is to use logistic regression (i.e. if they have the event or not). The main limitations of using logistic regression with event data are the inability to comment on the timing of events or to compare time-to-event between groups (e.g. younger vs. older patients) or by a predictor (e.g. ejection fraction). Often times, it is helpful to know how event risk changes over the course of the study period: when does event risk increase or decrease? Another alternative is to perform linear regression with time as the dependent variable, examining factors that increase or decrease time to an event; but, there are numerous issues with this approach. First, time most often has a skewed distribution, which violates assumptions of linear regression. Second, some participants will not have time data because they did not experience the event. This is called right censoring: we cannot “see” the event because it did not happen during the study period.

Survival analysis approaches address the unique nature of time-to-event data and censoring by design, and it is easy to use with standard software packages. In this paper, we will use survival analysis commands in Stata 5,6 (College Station, Texas, USA) but there are other programs both commercial 7 and open source. 8 As survival analysis is rooted in distinct language, we have outlined some of the key terms in Box 1.

Box 1:

Common Survival Analysis Terms

Event (also called a Failure):

A well-defined, clear, unambiguous change in which there are two mutually exclusive states that a participant can be assigned (e.g. alive or dead)

Censoring:

Incomplete (or unobserved) event time (e.g. the event does not happen during the study time period)

Truncation:

Using event occurrence (or non-occurrence) for participant selection (e.g. only including those who had an event)

Survivor Function:

Probability of surviving past a particular time

Hazard Rate:

Measure of risk of an event during a given time period

Hazard Ratio:

Ratio of two hazard rates (e.g. for two different groups)

Step-by-Step Approach

Step 1.

The first step is to identify the event(s) of interest. This is an important and often under-described aspect of survival analysis and will help you determine whether a person had an event or not. Regardless if you are identifying events a priori (i.e. prospective study) or post-hoc (i.e. retrospective analysis), your selection of events should be based on the literature and clinical relevance. These events need to be mutually exclusive; that is, either a participant has the event or not. Often, these events are clearly demarcated: alive vs. dead or hospitalized vs. not hospitalized. However, events may not be clearly demarcated (e.g. becoming symptomatic); in this case, a “threshold” needs to differentiate event vs. no event. Also, it is beneficial to clearly define what constitutes an event and adjudicate all of the endpoints to ensure consistent, transparent, and defensible findings. We encourage investigators to select outcomes with a few different constraints that will permit multiple levels of outcomes (e.g. delineating causes of hospitalization as all-cause, cardiovascular, or heart failure-specific). Finally, a composite of events is used frequently in survival analysis (e.g. all-cause mortality, hospitalization or emergency department visit). It is important to note that basic survival analysis handles composite event data by considering which event came first irrespective of which event is more severe and/or terminal.

Step 2.

The next step is to determine the duration a participant is at risk for the event. This will help you to determine when a person could have an event over the course of follow-up. If you are designing a prospective study, the duration of follow-up will depend on multiple factors, including the frequency of events and the resources you have available for follow-up. The frequency of events will also aid in a power analysis. For example, in heart failure, past literature has demonstrated that the event rate for cardiovascular events (death, hospitalization, emergency room admission) is about 40–50% over one year. 9,10 That means that about we can expect that over one year, 40–50% of participants will have an event. If your event rate is infrequent (e.g. implantation of a rare device), then you may need a longer duration of follow-up in order to detect enough events.

Step 3.

Then you will determine the scale for the event data. There are generally two approaches, which will dictate the type of survival analysis method. The first approach is to collect event data using a granular/fine time scale, which usually means days (but could be more granular). The second approach is to collect event data using a coarser time scale, which usually means larger time frames such as months or years – this is called interval censoring. With interval censoring, there are a lot of “ties,” meaning that multiple participants have the event in the same time frame. The conditions under which interval-censored analysis (also called discrete time analysis) is preferred over continuous time analysis are described elsewhere. 11,12

Step 4.

The next step is to organize your data clearly to facilitate data analysis. You will only need two variables to conduct a survival analysis beyond the participant ID and variables of interest. First, you need to have a failure variable that indicates that the participant had the event, usually no (0) vs. yes (1). Second, you need to have a follow-up time duration variable which is the time between the start of the study and the first event, or the time between the start of the study and the end of the study for participants who did not have an event. In other words, this is the duration that a participant was “at risk” for the event.

Step 5.

Before analyzing, you will need to declare your data to be survival data. In Stata, this is performed with the stset command. 6 At a minimum, you will need to have a failure variable and a time variable, which are created in Step 4. It is important to note that every survival analysis command thereafter is linked back to this declaration. Thus, if you decide to examine a different outcome and/or time frame, then you need a new survival data declaration.

Step 6.

The next step is to examine events and life tables similar to performing standard descriptive statistics. The stdescribe command in Stata 6 will provide a summary of the number of subjects, the entry and exit times, the total time at risk, and the number of failures ( Figure 1A ). You also can review life tables (term borrowed from actuarial science), which show the number of participants at risk during each time interval (e.g. day 1 to day 2, so forth) and how many “died” (i.e. had an event) in that same time interval ( Figure 1B ).

An external file that holds a picture, illustration, etc. Object name is nihms-1969648-f0001.jpg

Description of survival data and life table.

Example of the (A) overall description of survival data using the stdescribe command and (B) a life table (truncated) using the ltable command.

Step 7.

The next step is to examine the survival or failure rates over time; typically, this is performed using the Kaplan-Meier estimator function. 13 The Kaplan-Meier estimator estimates the unadjusted probability of surviving beyond a certain time point. In Stata, this is performed using the stslist command and the stgraph command. 6 From this estimation, you can generate graphs of either the survivor function (i.e. how many participants have “survived” or not had the event over time), or the cumulative hazard function (i.e. the cumulative hazard rate over time). These graphs ( Figure 2A or ​ orB) B ) can then be used to illustrate the rate of events and censoring over the period of study.

An external file that holds a picture, illustration, etc. Object name is nihms-1969648-f0002.jpg

Different types of Kaplan-Meier graphs.

Kaplan-Meier graphs of (A) survivor function, and (B) cumulative hazard (including number at risk every 50 days). Abbreviations: CI, confidence interval

Step 8.

Next, determine which variables you want to test to predict your event variable and then perform simple comparative statistics. The simplest way is to select a time-independent variable (i.e. one that does not vary over time). Typically, this is a variable that is measured at the origin time and either does not change normally (e.g. sex) or could change but is measured only once (e.g. ejection fraction). The log rank test is a simple test that compares the survivor functions of two or more groups and tests the null hypothesis that there is no difference in the probability of an event at any point. 14 You also can examine time-dependent variables (e.g. ejection fraction measured repeatedly) but that requires additional steps. 15

Step 9.

Then you can examine your variables of interest in a multivariate regression model, controlling for additional variables. While there are some parametric approaches 16 (e.g. Weibull distributions, etc.), the semi-parametric approach is most commonly used, and it forms the basis for a Cox proportional hazards model. 17 The Cox proportional hazards model incorporates a baseline hazard function plus a hazard function of other predictors that an individual has or does not have. The model generates regression coefficients in the form of hazard ratios. The hazard ratios are assumed to be proportional or constant over time; 17,18 this assumption can be tested both visually (stphplot command) and numerically (estat phtest command). 6

Step 10.

The final step described here is to interpret the findings in aggregate, adjusting the analysis as needed, and then reporting the findings. The hazard ratios are interpreted in terms of a one unit change in the value of the variable, which is similar to interpreting odds ratios or risk ratios. For example, a hazard ratio of 1.5 would be interpreted as a 50% greater risk in the event happening within the specified time frame (assuming significance). Precision around hazard ratios should be reported as 95% confidence intervals. There are more advanced approaches to consider, such as competing risks analysis, 19 which could help account for the complexity of the timing of composite events. When reporting, it is helpful to present the findings in multiple formats, including tabular and graphical formats. Kaplan-Meier graphs are a well-recognized way to graphically depict survivor or hazard functions with a few important elements to include: appropriate y axis labels and scales, correct labeling of curves, adding numbers at risk, including confidence intervals around the estimates, and adding significance results (e.g. log rank test).

Example

To illustrate these steps, we analyzed combined follow-up data from three studies of patients with heart failure. 20–22 Our research questions were 1) what are the number of cardiovascular events for patients with heart failure over six months? and 2) what is the influence of age, gender, New York Heart Association (NYHA) functional classification, and comorbidity burden on event risk? For suggested reporting, see Figure 3A – C and Table 1 .

An external file that holds a picture, illustration, etc. Object name is nihms-1969648-f0003.jpg

Six-month cardiovascular event rates by subgroups.

Kaplan-Meier survivor function estimate of cardiovascular events (A) by gender, (B) by New York Heart Association Functional Classification, and (C) by comorbidity burden (including number at risk every 50 days). Log rank tests are also reported for each comparison. Abbreviations: CI, confidence interval; NYHA, New York Heart Association.

Table 1:

Cox Proportional Hazards Multivariate Model for 6-Month Event Risk

HR (95%CI)p value
Age1.00 (0.99– 1.01)0.76
Women (vs. Men)0.79 (0.54–1.14)0.21
NYHA III/IV (vs. I/II)3.03 (1.96–4.70)< 0.001
Comorbidity Burden *
Medium (3–4 comorbidities)1.29 (0.87–1.94)0.21
High (5+ comorbidities)1.75 (1.00–3.03)0.048

Abbreviations: CI, confidence interval; HR, hazard ratio; NYHA, New York Heart Association Class

* Low comorbidity burden is the referent category

The event of interest was time to first cardiovascular event, a composite outcome that included all-cause death or cardiovascular hospitalization or emergency department visit. The duration a participant was at risk for the event was six months. Event data were recorded in days (i.e. day of event or right censoring). The sample included 403 participants (59% male; mean age 58.7±14.4 years). The majority of participants (58%) were NYHA class III/IV. The Charlson Comorbidity Index was categorized as low (1–2 comorbidities (58%)), medium (3–4 comorbidities (33%)), and high (5 and more comorbidities (9%)). A total of 118 cardiovascular events (29%) were reported, which included 10 deaths, 82 cardiovascular hospitalizations, and 26 cardiovascular emergency department visits. Each participant entered the study period at time zero, and median exit time was 149.4 days (range = 1–185 days) ( Figure 1A – B ).

Kaplan-Meier Plots and Log Rank Tests.

The overall Kaplan-Meier estimates are presented in Figure 2A – B . There were no significant differences in time-to-event between women and men ( Figure 3A ). Participants classified as NYHA III/IV had an increased event risk than participants classified as NYHA I/II ( Figure 3B ). There was also a significant difference in time-to-event between comorbidity categories ( Figure 3C ).

Cox Proportional Hazards Analysis.

After including all four variables in an adjusted multivariable regression model, NYHA classification and comorbidity burden were significant, but age and gender were not ( Table 1 ). Specifically, those with NYHA Class III/IV were about two times more likely to experience an event than those with NYHA Class I/II, and those with high comorbidity burden were about 75% more likely to experience an event than those with low comorbidity burden. Finally, the proportional hazards assumption was not violated ( Figure 4 ). In conclusion, the median time for a cardiovascular event among patients with heart failure was around five months. Significant predictors of increased cardiovascular event risk were NYHA III/IV classification and high comorbidity burden.

An external file that holds a picture, illustration, etc. Object name is nihms-1969648-f0004.jpg

Proportional hazards graph.

Example of testing the proportional hazards assumption by plotting the curves for both New York Heart Association Functional Class groups. The assumption is not violated when the curves are parallel.

Conclusion

Survival analysis or time-to-event analysis is a unique and highly informative approach that, when used appropriately, can yield important information about the risk of an event over a specific time frame. Herein, we provided a starting point for future researchers considering time-to-event analysis, which is a very common technique in cardiovascular nursing research.

An external file that holds a picture, illustration, etc. Object name is nihms-1969648-f0005.jpg

Overview of the 10 steps for getting started with analyzing time-to-event data. Example graph and write-up are also provided. Abbreviations: CI, confidence interval; HF, heart failure; NYHA, New York Heart Association.

Sources of Funding

Data reported in this paper were generated from studies funded by the Office of Research on Women’s Health and the Eunice Kennedy Shriver National Institute of Child Health & Human Development of the NIH (K12HD043488 to QED and CSL) and the American Heart Association (award number 11BGIA7840062 to CSL). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or American Heart Association.

Footnotes

Declaration of Conflicting Interests

References

1. Allison PD. Survival Analysis. In: Hancock GR, Mueller RO, eds. The Reviewer’s Guide to Quantitative Methods in the Social Sciences . New York: Routledge; 2010, p413–425. [Google Scholar]

2. Vilela de Sousa T, Cavalcante A, Lima NX, et al. Cardiovascular risk factors in the elderly: a 10-year follow-up survival analysis . Eur J Cardiovasc Nurs 2022. doi: 10.1093/eurjcn/zvac040 [PubMed] [CrossRef] [Google Scholar]

3. Chan YK, Stickland N, Stewart S. An inevitable or modifiable trajectory towards heart failure in high-risk individuals: insights from the nurse-led intervention for less chronic heart failure (NIL-CHF) study . Eur J Cardiovasc Nurs 2022. doi: 10.1093/eurjcn/zvac036 [PubMed] [CrossRef] [Google Scholar]

4. Ogawa M, Satomi-Kobayashi S, Hamaguchi M, et al. Postoperative dysphagia as a predictor of functional decline and prognosis after undergoing cardiovascular surgery . Eur J Cardiovasc Nurs 2022. doi: 10.1093/eurjcn/zvac084 [PubMed] [CrossRef] [Google Scholar]

6. StataCorp. Stata Statistical Software: Release 17 . In. College Station, TX: StataCorp LLC; 2021. [Google Scholar]

7. Corporation I. SPSS Statistics Kaplan-Meier Survival Analysis . In. 29.0 ed; 2022. [Google Scholar]

8. Therneau T. A package for survival analysis in R . In; 2022.

9. Lee CS, Gelow JM, Denfeld QE, et al. Physical and psychological symptom profiling and event-free survival in adults with moderate to advanced heart failure . J Cardiovasc Nurs 2014; 29 :315–323. doi: 10.1097/JCN.0b013e318285968a [PMC free article] [PubMed] [CrossRef] [Google Scholar]

10. Tsao CW, Aday AW, Almarzooq ZI, et al. Heart Disease and Stroke Statistics-2022 Update: A Report From the American Heart Association . Circulation 2022; 145 :e153–e639. doi: 10.1161/CIR.0000000000001052 [PubMed] [CrossRef] [Google Scholar]

11. Leffondre K, Touraine C, Helmer C, Joly P. Interval-censored time-to-event and competing risk with death: is the illness-death model more accurate than the Cox model? Int J Epidemiol 2013; 42 :1177–1186. doi: 10.1093/ije/dyt126 [PubMed] [CrossRef] [Google Scholar]

12. Suresh K, Severn C, Ghosh D. Survival prediction models: an introduction to discrete-time modeling . BMC Medical Research Methodology 2022; 22 :207. doi: 10.1186/s12874-022-01679-6 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

13. Kaplan EL, Meier P. Nonparametric Estimation from Incomplete Observations . Journal of the American Statistical Association 1958; 53 :457–481. doi: 10.1080/01621459.1958.10501452 [CrossRef] [Google Scholar]

14. Bland JM, Altman DG. The logrank test . Bmj 2004; 328 :1073. doi: 10.1136/bmj.328.7447.1073 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

15. Giolo SR, Krieger JE, Mansur AJ, Pereira AC. Survival analysis of patients with heart failure: implications of time-varying regression effects in modeling mortality . PLoS One 2012; 7 :e37392. doi: 10.1371/journal.pone.0037392 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

16. Bradburn MJ, Clark TG, Love SB, Altman DG. Survival analysis part II: multivariate data analysis--an introduction to concepts and methods . Br J Cancer 2003; 89 :431–436. doi: 10.1038/sj.bjc.6601119 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

17. Cox DR. Regression Models and Life-Tables . Journal of the Royal Statistical Society: Series B (Methodological) 1972; 34 :187–202. doi: 10.1111/j.2517-6161.1972.tb00899.x [CrossRef] [Google Scholar]

18. Hess KR. Graphical methods for assessing violations of the proportional hazards assumption in Cox regression . Stat Med 1995; 14 :1707–1723. doi: 10.1002/sim.4780141510 [PubMed] [CrossRef] [Google Scholar]

19. Austin PC, Lee DS, Fine JP. Introduction to the Analysis of Survival Data in the Presence of Competing Risks . Circulation 2016; 133 :601–609. doi: 10.1161/CIRCULATIONAHA.115.017719 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

20. Lee CS, Mudd JO, Hiatt SO, et al. Trajectories of heart failure self-care management and changes in quality of life . European Journal of Cardiovascular Nursing 2015; 14 :486–494. doi: 10.1177/1474515114541730 [PubMed] [CrossRef] [Google Scholar]

21. Lee CS, Lyons KS, Gelow JM, et al. Validity and reliability of the European Heart Failure Self-care Behavior Scale among adults from the United States with symptomatic heart failure . European Journal of Cardiovascular Nursing 2013; 12 :214–218. doi: 10.1177/1474515112469316 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

22. Denfeld QE, Purnell JQ, Lee CS, et al. Candidate biomarkers of physical frailty in heart failure: an exploratory cross-sectional study . Eur J Cardiovasc Nurs 2022. doi: 10.1093/eurjcn/zvac054 [PMC free article] [PubMed] [CrossRef] [Google Scholar]