A gentle introduction to group sequential design

Introduction

This article is intended to give a gentle mathematical and statistical introduction to group sequential design. We also provide relatively simple examples from the literature to explain clinical applications. There is no programming shown, but by accessing the source for the article all required programming can be accessed; substantial commenting is provided in the source in the hope that users can understand how to implement the concepts developed here. Hopefully, the few mathematical and statistical concepts introduced will not discourage those wishing to understand some underlying concepts for group sequential design.

A group sequential design enables repeated analysis of an endpoint for a clinical trial to enable possible early stopping of a trial for either a positive result, for futility, or for a safety issue. This approach can

  • limit exposure risk to patients and clinical trial investment past the time where known unacceptable safety risks have been established for the endpoint of interest,
  • limit investment in a trial where interim results suggest further evaluation for a positive efficacy finding is futile, or
  • accelerate the availability of a highly effective treatment by enabling early approval following an early positive finding.

Examples of outcomes that might be considered include:

  • a continuous outcome such as change from baseline at some fixed follow-up time in the HAM-D depression score,
  • absolute or difference or risk ratio for a response rate (e.g., in oncology) or failure rate for a binary (yes/no) outcome, and
  • a hazard ratio for a time-to-event out such such as time-to-death or disease progression in an oncology trial or for time until a cardiovascular event (death, myocardial infarction or unstable angina).

Examples of the above include:

  • a new treatment for major depression where an interim analysis of a continuous outcome stopped the trial for futility (Binneman et al. (2008)),
  • a new treatment for patients with unstable angina undergoing balloon angioplasty with a positive interim finding for a binary outcome of death, myocardial infarction or urgent repeat intervention within 30 days (The CAPTURE Investigators (1997)), and
  • a new treatment for patients with lung cancer based on a positive interim finding for time-to-death (Gandhi et al. (2018)).

Group sequential design framework

We assume

  • A two-arm clinical trial with a control and experimental group.
  • There are k analyses planned for some integer k > 1.
  • There is a natural parameter δ describing the underlying treatment difference with an estimate that has an asymptotically normal and efficient estimate δ̂j with variance σj2 and corresponding statistical information j = 1/σj2, at analysis j = 1, 2, …, k. A positive value favoring experimental treatment and negative value favoring control. We assume a consistent estimate σ̂j2 of σj2, j = 1, 2, …, k.
  • The information fraction is defined as tj = ℐi/ℐj at analysis j = 1, …, k.
  • Correlations between estimates at different analyses are $\text{Corr}(\hat\delta_i,\hat\delta_j)=\sqrt{\mathcal{I}_i/\mathcal{I}_j}=\sqrt{t_j}$ for 1 ≤ i ≤ j ≤ k.
  • There is a test test Zj ≈ δ̂j/σ̂j2.

For a time-to-event outcome, δ would typically represent the logarithm of the hazard ratio for the control group versus the experimental group. For a difference in response rates, δ would represent the underlying response rates. For a continuous outcome such as the HAM-D, we would examine the difference in change from baseline at a milestone time point (e.g., at 6 weeks as in Binneman et al. (2008)). For j = 1, …, k, the tests Zj are asymptotically multivariate normal with correlations as above, and for i = 1, …, k have Cov(Zi, Zj) = Corr(δ̂i, δ̂j) and $E(Z_j)=\delta\sqrt{I_j}.$

This multivariate asymptotic normal distribution for Z1, …, Zk is referred to as the canonical form by Jennison and Turnbull (2000) who have also summarized much of the surrounding literature.

Bounds for testing

One-sided testing

We assume that the primary test the null hypothesis H0: δ = 0 against the alternative H1: δ = δ1 for a fixed effect size δ1 > 0 which represents a benefit of experimental treatment compared to control. We assume further that there is interest in stopping early if there is good evidence to reject one hypothesis in favor of the other. For i = 1, 2, …, k − 1, interim cutoffs li < ui are set; final cutoffs lk ≤ uk are also set. For i = 1, 2, …, k, the trial is stopped at analysis i to reject H0 if lj < Zj < uj, j = 1, 2, …, i − 1 and Zi ≥ ui. If the trial continues until stage i, H0 is not rejected at stage i, and Zi ≤ li then H1 is rejected in favor of H0, i = 1, 2, …, k. Thus, 3k parameters define a group sequential design: li, ui, and i, i = 1, 2, …, k. Note that if lk < uk there is the possibility of completing the trial without rejecting H0 or H1. We will often restrict lk = uk so that one hypothesis is rejected.

We begin with a one-sided test. In this case there is no interest in stopping early for a lower bound and thus li = −∞, i = 1, 2, …, k. The probability of first crossing an upper bound at analysis i, i = 1, 2, …, k, is

αi+(δ) = Pδ{{Zi ≥ ui}∩j = 1i − 1{Zj < uj}}

The Type I error is the probability of ever crossing the upper bound when δ = 0. The value αi+(0) is commonly referred to as the amount of Type I error spent at analysis i, 1 ≤ i ≤ k. The total upper boundary crossing probability for a trial is denoted in this one-sided scenario by

$$\alpha^+(\delta) \equiv \sum_{i=1}^{k}\alpha^+_{i}(\delta)$$

and the total Type I error by α+(0). Assuming α+(0) = α the design will be said to provide a one-sided group sequential test at level α.

Asymmetric two-sided testing

With both lower and upper bounds for testing and any real value δ representing treatment effect we denote the probability of crossing the upper boundary at analysis i without previously crossing a bound by

αi(δ) = Pδ{{Zi ≥ ui}∩j = 1i − 1{lj < Zj < uj}},

i = 1, 2, …, k. The total probability of crossing an upper bound prior to crossing a lower bound is denoted by

$$\alpha(\delta)\equiv\sum_{i=1}^{k}\alpha_{i}(\delta).$$

Next, we consider analogous notation for the lower bound. For i = 1, 2, …, k denote the probability of crossing a lower bound at analysis i without previously crossing any bound by βi(δ) = Pδ{{Zi ≤ li}∩j = 1i − 1{lj < Zj < uj}}. The total lower boundary crossing probability in this case is written as $$\beta(\delta)= {\sum\limits_{i=1}^{k}} \beta_{i}(\delta).$$

When a design has final bounds equal (lk = uk), β(δ1) is the Type II error which is equal to 1 minus the power of the design. In this case, βi(δ) is referred to as the β-spending at analysis i, i = 1, …, k.

Spending function design

Type I error is most often defined with αi+(0), i = 1, …, k. This is referred to as non-binding Type I error since any lower bound is ignored in the calculation. This means that if a trial is continued in spite of a lower bound being crossed at an interim analysis that Type I error is still controlled at the design α-level. For Phase III trials used for approvals of new treatments, non-binding Type I error calculation is generally expected by regulators.

For any given 0 < α < 1 we define a non-decreasing α-spending function f(t; α) for t ≥ 0 with α(0) = 0 and for t ≥ 1, f(t; α) = α. Letting t0 = 0, we set αj(0) for j = 1, …, k through the equation αj+(0) = f(tj; α) − f(tj − 1; α). Assuming an asymmetric lower bound, we similarly use a β-spending function and to set β-spending at analysis j = 1, …, k as: βj(δ1) = g(tj; δ1, β) − g(tj − 1; δ1, β).

In the following example, the function Φ() represents the cumulative distribution function for the standard normal distribution function (i.e., mean 0, standard deviation 1). The major depression study of Binneman et al. (2008) considered above used the Lan and DeMets (1983) spending function approximating an O’Brien-Fleming bound for a single interim analysis half way through the trial with

$$f(t; \alpha) = 2\left( 1-\Phi\left( \frac{\Phi ^{-1}(\alpha/2)}{\sqrt{t}}\right) \right).$$

$$g(t; \beta) = 2\left( 1-\Phi\left( \frac{\Phi ^{-1}(\beta/2)}{\sqrt{t}}\right) \right).$$

library(gsDesign)
delta1 <- 3 # Treatment effect, alternate hypothesis
delta0 <- 0 # Treatment effect, null hypothesis
ratio <- 1 # Randomization ratio (experimental / control)
sd <- 7.5 # Standard deviation for change in HAM-D score
alpha <- 0.1 # 1-sided Type I error
beta <- 0.17 # Targeted Type II error (1 - targeted power)
k <- 2 # Number of planned analyses
test.type <- 4 # Asymmetric bound design with non-binding futility bound
timing <- .5 # information fraction at interim analyses
sfu <- sfLDOF # O'Brien-Fleming spending function for alpha-spending
sfupar <- 0 # Parameter for upper spending function
sfl <- sfLDOF # O'Brien-Fleming spending function for beta-spending
sflpar <- 0 # Parameter for lower spending function
delta <- 0
endpoint <- "normal"
# Derive normal fixed design sample size
n <- nNormal(
  delta1 = delta1,
  delta0 = delta0,
  ratio = ratio,
  sd = sd,
  alpha = alpha,
  beta = beta
)
# Derive group sequential design based on parameters above
x <- gsDesign(
  k = k,
  test.type = test.type,
  alpha = alpha,
  beta = beta,
  timing = timing,
  sfu = sfu,
  sfupar = sfupar,
  sfl = sfl,
  sflpar = sflpar,
  delta = delta, # Not used since n.fix is provided
  delta1 = delta1,
  delta0 = delta0,
  endpoint = "normal",
  n.fix = n
)
# Convert sample size at each analysis to integer values
x <- toInteger(x)
#> toInteger: rounding done to nearest integer since ratio was not specified as postive integer .

The planned design used α = 0.1, one-sided and Type II error 17% (83% power) with an interim analysis at 50% of the final planned observations. This leads to Type I α-spending of 0.02 and β-spending of 0.052 at the planned interim. An advantage of the spending function approach is that bounds can be adjusted when the number of observations at analyses are different than planned. The actual observations for experimental versus control at the analysis were 59 as opposed to the planned 67, which resulted in interim spending fraction t1= 0.4403. With the Lan-DeMets spending function to approximate O’Brien-Fleming bounds this results in α-spending of 0.0132 (P(Cross) if delta=0 row in Efficacy column) and β-spending of 0.0386 (P(Cross) if delta=3 row in Futility column). We note that the Z-value and 1-sided p-values in the table below correspond exactly and either can be used for evaluation of statistical significance for a trial result. The rows labeled ~delta at bound are approximations that describe approximately what treatment difference is required to cross a bound; these should not be used for a formal evaluation of whether a bound has been crossed. The O’Brien-Fleming spending function is generally felt to provide conservative bounds for stopping at interim analysis. Most of the error spending is reserved for the final analysis in this example. The futility bound only required a small trend in the wrong direction to stop the trial; a nominal p-value of 0.77 was observed which crossed the futility bound, stopping the trial since this was greater than the futility p-value bound of 0.59. Finally, we note that at the final analysis, the cumulative probability for P(Cross) if delta=0 is less than the planned α = 0.10. This probability represents α(0) which excludes the probability of crossing the lower bound at the interim analysis and the final analysis. The value of the non-binding Type I error is still α+(0) = 0.10.

# Updated alpha is unchanged
alphau <- 0.1
# Updated sample size at each analysis
n.I <- c(59, 134)
# Updated number of analyses
ku <- length(n.I)
# Information fraction is used for spending
usTime <- n.I / x$n.I[x$k]
lsTime <- usTime
# Update design based on actual interim sample size and planned final sample size
xu <- gsDesign(
  k = ku,
  test.type = test.type,
  alpha = alphau,
  beta = x$beta,
  sfu = sfu,
  sfupar = sfupar,
  sfl = sfl,
  sflpar = sflpar,
  n.I = n.I,
  maxn.IPlan = x$n.I[x$k],
  delta = x$delta,
  delta1 = x$delta1,
  delta0 = x$delta0,
  endpoint = endpoint,
  n.fix = n,
  usTime = usTime,
  lsTime = lsTime
)
# Summarize bounds
gsBoundSummary(xu, Nname = "N", digits = 4, ddigits = 2, tdigits = 1)
#>   Analysis               Value Efficacy Futility
#>  IA 1: 44%                   Z   2.2209  -0.2304
#>      N: 59         p (1-sided)   0.0132   0.5911
#>                ~delta at bound   4.3370  -0.4500
#>            P(Cross) if delta=0   0.0132   0.4089
#>            P(Cross) if delta=3   0.2468   0.0386
#>      Final                   Z   1.3047   1.3047
#>     N: 134         p (1-sided)   0.0960   0.0960
#>                ~delta at bound   1.6907   1.6907
#>            P(Cross) if delta=0   0.0965   0.9035
#>            P(Cross) if delta=3   0.8350   0.1650

References

Binneman, Brendon, Douglas Feltner, Sheela Kolluri, Yuanjun Shi, Ruolun Qiu, and Thomas Stiger. 2008. “A 6-Week Randomized, Placebo-Controlled Trial of CP-316,311 (a Selective CRH1 Antagonist) in the Treatment of Major Depression.” American Journal of Psychiatry 165 (5): 617–20.
Gandhi, Leena, Delvys Rodrı́guez-Abreu, Shirish Gadgeel, Emilio Esteban, Enriqueta Felip, Flávia De Angelis, Manuel Domine, et al. 2018. “Pembrolizumab Plus Chemotherapy in Metastatic Non–Small-Cell Lung Cancer.” New England Journal of Medicine 378 (22): 2078–92.
Jennison, Christopher, and Bruce W. Turnbull. 2000. Group Sequential Methods with Applications to Clinical Trials. Boca Raton, FL: Chapman; Hall/CRC.
Lan, K. K. G., and David L. DeMets. 1983. “Discrete Sequential Boundaries for Clinical Trials.” Biometrika 70: 659–63.
The CAPTURE Investigators. 1997. “Randomized Placebo-Controlled Trial of Abciximab Before and During Coronary Intervention in Refractory Unstable Angina: The CAPTURE Study.” Lancet 349 (9063): 1429–35.