Futility and harm bounds for overall survival monitoring

Introduction

When clinical trials include overall survival (OS) as a secondary or exploratory endpoint, regulators may recommend not only monitoring for early evidence of efficacy and futility, but also for potential harm — that is, evidence that the experimental treatment may be worsening survival relative to control. This article demonstrates how the gsDesign package supports group sequential designs with three boundaries: an efficacy (upper) bound, a futility (lower) bound, and a harm bound, using test.type = 7 (binding) and test.type = 8 (non-binding).

Regulatory context: FDA guidance on OS monitoring in oncology

The FDA draft guidance Assessment of Overall Survival Evidence in Support of Accelerated Approval of Oncology Therapeutics (U.S. Food and Drug Administration 2024) describes expectations for monitoring OS in the context of trials that may receive accelerated approval based on surrogate endpoints. The guidance states that sponsors should specify pre-planned boundaries for interim OS monitoring, including criteria for stopping a trial early if there is evidence of a detrimental effect on OS. Key points include:

Sponsors should include a pre-specified statistical analysis plan for interim OS analyses, including the timing and number of interim looks.
At a minimum, the guidance expects monitoring for OS harm (i.e., a detrimental trend in overall survival) using pre-specified boundaries.
Separate from the harm boundary, the sponsor should establish a futility boundary to stop the trial if the experimental treatment is unlikely to demonstrate an OS benefit.
The statistical plan should describe the spending functions used for each boundary and how the overall Type I error and Type II error are controlled.

This motivates the design framework with test.type = 7 (binding futility and harm bounds) and test.type = 8 (non-binding futility and harm bounds), where three boundaries are simultaneously specified using spending functions.

The harm bound implemented in gsDesign is a new method that is easy to use — a principled, straightforward extension of the widely used group sequential spending function framework. While we believe this approach is understandable, useful, and flexible, other methods for monitoring potential harm may also be considered. However, there are limitations with this approach. The example presented here has higher mortality risk than many cases. With lower mortality risk, modifications of this approach or other approaches may be preferable.

Design framework overview

In a standard two-sided asymmetric group sequential design (test.type = 3 or 4), there are two boundaries:

Efficacy (upper) bound: Reject $H_0$ if the test statistic exceeds this boundary (evidence of treatment benefit).
Futility (lower) bound: Stop for futility if the test statistic falls below this boundary (insufficient evidence of treatment benefit).

The harm bound extension (test.type = 7 or 8) adds a third boundary:

Harm bound: Signal that the experimental treatment may be harming patients (evidence of a detrimental effect).

The harm bound lies below the futility bound. At each analysis, there are four possible outcomes:

Cross the efficacy bound (above): Stop for efficacy.
Between the efficacy and futility bounds: Continue the trial.
Cross the futility bound but not the harm bound (between futility and harm): Stop for futility.
Cross the harm bound (below): Stop for harm.

The harm bound is intended so that if a small observed p-value favoring control is observed, the harm bound will be crossed. That is, the harm bound flags evidence that the experimental treatment may be worsening survival — a negative treatment effect on the log hazard ratio scale.

Design with non-binding bounds (`test.type = 8`)

We demonstrate a survival design using gsSurvCalendar() with test.type = 8 (non-binding futility and harm bounds). The scenario is based on a 1:1 randomized trial monitoring overall survival with:

Median control survival: 3 years (36 months), i.e., $\lambda_C = \log(2)/36$.
Target hazard ratio: HR = 0.75 (25% reduction in hazard).
Power: 90% ($\beta = 0.1$).
One-sided $\alpha$: 0.0125 (e.g., the OS component of a trial with multiplicity adjustment).
Enrollment: Uniform enrollment over 18 months.
Study duration: 5 years (60 months) with planned analyses at years 1, 2, 3, 4, and 5 from start of enrollment.

The astar parameter controls the total spending for the harm bound under $H_0$. We set astar = 0.1, meaning the total probability of crossing the harm bound under $H_0$ is 10%.

Spending function specification

We specify:

Efficacy bound: Lan-DeMets O’Brien-Fleming (sfLDOF) spending function (conservative, spending little $\alpha$ at early analyses).
Futility bound: Hwang-Shih-DeCani (HSD) spending function with $\gamma = -2$ (moderate $\beta$-spending under $H_1$).
Harm bound: Lan-DeMets Pocock (sfLDPocock) spending function (spending under $H_0$ for detecting harm).

x8 <- gsSurvCalendar(
  test.type = 8,
  alpha = 0.0125,
  beta = 0.1,
  astar = 0.1,
  calendarTime = c(12, 24, 36, 48, 60),
  sfu = sfLDOF,
  sfl = sfHSD, sflpar = -2,
  sfharm = sfLDPocock,
  lambdaC = log(2) / 36,
  hr = 0.75,
  R = 18,
  minfup = 42
)

Summary

The summary() method provides a concise description of the design:

cat(strwrap(summary(x8), width = 65), sep = "\n")
#> Asymmetric two-sided group sequential design with non-binding
#> futility and harm bounds, 5 analyses, time-to-event outcome with
#> sample size 1148 and 657 events required, 90 percent power, 1.25
#> percent (1-sided) Type I error to detect a hazard ratio of 0.75.
#> Enrollment and total study durations are assumed to be 18 and 60
#> months, respectively. Efficacy bounds derived using a Lan-DeMets
#> O'Brien-Fleming approximation spending function (no parameters).
#> Futility bounds derived using a Hwang-Shih-DeCani spending
#> function with gamma = -2. Harm bounds derived using a Lan-DeMets
#> Pocock approximation spending function.

Detailed boundary table

The gsBoundSummary() function produces a tabular summary with columns for each boundary. By default, B-value, Spending, CP, CP H1, and PP are excluded. We note that for the first interim analysis, the efficacy bound is so extreme it is effectively impossible to cross. However, the harm and futility bounds are more moderate, allowing for early stopping if there is evidence of harm or futility. The futility bound is an indicator of why bounds are often non-binding — the futility bound is not intended to be a strict stopping rule, but rather a signal that the trial may be unlikely to succeed if it continues. Crossing the harm bound is a stronger indication that the treatment may be harmful, and the trial should be at least paused with a recommendation to review the safety and other endpoint data.

gsBoundSummary(x8)
#> Method: LachinFoulkes 
#>     Analysis               Value    Harm Futility Efficacy
#>    IA 1: 11%                   Z -2.1121  -1.4408       NA
#>       N: 766         p (1-sided)  0.9827   0.9252       NA
#>   Events: 73        ~HR at bound  1.6434   1.4034       NA
#>    Month: 12    P(Cross) if HR=1  0.0173   0.0575       NA
#>              P(Cross) if HR=0.75  0.0004   0.0034       NA
#>    IA 2: 38%                   Z -1.7667   0.1212   3.8622
#>      N: 1148         p (1-sided)  0.9614   0.4518   0.0001
#>  Events: 253        ~HR at bound  1.2491   0.9849   0.6149
#>    Month: 24    P(Cross) if HR=1  0.0416   0.5138   0.0001
#>              P(Cross) if HR=0.75  0.0004   0.0177   0.0574
#>    IA 3: 63%                   Z -1.7256   1.0566   2.9347
#>      N: 1148         p (1-sided)  0.9578   0.1454   0.0017
#>  Events: 416        ~HR at bound  1.1846   0.9015   0.7497
#>    Month: 36    P(Cross) if HR=1  0.0417   0.8224   0.0017
#>              P(Cross) if HR=0.75  0.0004   0.0394   0.4990
#>    IA 4: 83%                   Z -1.7170   1.7357   2.5278
#>      N: 1148         p (1-sided)  0.9570   0.0413   0.0057
#>  Events: 548        ~HR at bound  1.1580   0.8622   0.8057
#>    Month: 48    P(Cross) if HR=1  0.0417   0.9214   0.0062
#>              P(Cross) if HR=0.75  0.0004   0.0670   0.7996
#>        Final                   Z -1.7149   2.3072   2.3072
#>      N: 1148         p (1-sided)  0.9568   0.0105   0.0105
#>  Events: 657        ~HR at bound  1.1433   0.8352   0.8352
#>    Month: 60    P(Cross) if HR=1  0.0417   0.9471   0.0112
#>              P(Cross) if HR=0.75  0.0004   0.0996   0.9000

Conditional power (CP, CP H1) and predictive power (PP) can also be included in the summary. Below we show the full table with all statistics, including conditional and predictive power at each boundary:

gsBoundSummary(x8, exclude = c())
#> Method: LachinFoulkes 
#>     Analysis               Value    Harm Futility Efficacy
#>    IA 1: 11%                   Z -2.1121  -1.4408       NA
#>       N: 766         p (1-sided)  0.9827   0.9252       NA
#>   Events: 73        ~HR at bound  1.6434   1.4034       NA
#>    Month: 12            Spending  0.0173   0.0039       NA
#>                          B-value -0.7011  -0.4782       NA
#>                               CP  0.0000   0.0000       NA
#>                            CP H1  0.4619   0.5942       NA
#>                               PP  0.0011   0.0097       NA
#>                 P(Cross) if HR=1  0.0173   0.0575       NA
#>              P(Cross) if HR=0.75  0.0004   0.0034       NA
#>    IA 2: 38%                   Z -1.7667   0.1212   3.8622
#>      N: 1148         p (1-sided)  0.9614   0.4518   0.0001
#>  Events: 253        ~HR at bound  1.2491   0.9849   0.6149
#>    Month: 24            Spending  0.0334   0.0143   0.0001
#>                          B-value -1.0954   0.0751   2.3947
#>                               CP  0.0000   0.0024   1.0000
#>                            CP H1  0.0097   0.4033   0.9994
#>                               PP  0.0000   0.0358   0.9994
#>                 P(Cross) if HR=1  0.0416   0.5138   0.0001
#>              P(Cross) if HR=0.75  0.0004   0.0177   0.0574
#>    IA 3: 63%                   Z -1.7256   1.0566   2.9347
#>      N: 1148         p (1-sided)  0.9578   0.1454   0.0017
#>  Events: 416        ~HR at bound  1.1846   0.9015   0.7497
#>    Month: 36            Spending  0.0229   0.0217   0.0016
#>                          B-value -1.3725   0.8404   2.3343
#>                               CP  0.0000   0.0396   0.9928
#>                            CP H1  0.0000   0.3449   0.9928
#>                               PP  0.0000   0.0776   0.9759
#>                 P(Cross) if HR=1  0.0417   0.8224   0.0017
#>              P(Cross) if HR=0.75  0.0004   0.0394   0.4990
#>    IA 4: 83%                   Z -1.7170   1.7357   2.5278
#>      N: 1148         p (1-sided)  0.9570   0.0413   0.0057
#>  Events: 548        ~HR at bound  1.1580   0.8622   0.8057
#>    Month: 48            Spending  0.0154   0.0277   0.0046
#>                          B-value -1.5689   1.5860   2.3098
#>                               CP  0.0000   0.1578   0.8708
#>                            CP H1  0.0000   0.3906   0.9337
#>                               PP  0.0000   0.1793   0.8485
#>                 P(Cross) if HR=1  0.0417   0.9214   0.0062
#>              P(Cross) if HR=0.75  0.0004   0.0670   0.7996
#>        Final                   Z -1.7149   2.3072   2.3072
#>      N: 1148         p (1-sided)  0.9568   0.0105   0.0105
#>  Events: 657        ~HR at bound  1.1433   0.8352   0.8352
#>    Month: 60            Spending  0.0110   0.0325   0.0062
#>                          B-value -1.7149   2.3072   2.3072
#>                 P(Cross) if HR=1  0.0417   0.9471   0.0112
#>              P(Cross) if HR=0.75  0.0004   0.0996   0.9000

Interpreting the boundaries

The design has five analyses at calendar times of 12, 24, 36, 48, and 60 months. At each analysis, the test statistic (Z-value) is compared against three boundaries:

bounds <- data.frame(
  Analysis = 1:x8$k,
  Month = x8$T,
  Events = ceiling(x8$n.I),
  Harm = round(x8$harm$bound, 2),
  Futility = round(x8$lower$bound, 2),
  Efficacy = round(x8$upper$bound, 2)
)
kable(bounds, caption = "Z-value boundaries at each analysis")

Z-value boundaries at each analysis
Analysis	Month	Events	Harm	Futility	Efficacy
1	12	73	-2.11	-1.44	7.43
2	24	253	-1.77	0.12	3.86
3	36	416	-1.73	1.06	2.93
4	48	548	-1.72	1.74	2.53
5	60	657	-1.71	2.31	2.31

Decision rules at an analysis where all three bounds are active:

If $Z >$ efficacy bound: Stop for efficacy (reject $H_0$).
If futility bound $< Z \leq$ efficacy bound: Continue the trial.
If harm bound $< Z \leq$ futility bound: Stop for futility.
If $Z \leq$ harm bound: Stop for harm.

When both lower bounds are active, the harm bound is always at or below the futility bound. If futility is skipped but harm is tested, the harm bound is the sole active lower stopping boundary. At early analyses, the harm and futility bounds may coincide when the harm spending function has not yet allocated sufficient spending to differentiate them.

Boundary crossing probabilities

We examine the operating characteristics under two scenarios: no treatment effect (HR = 1, i.e., under $H_0$) and the design alternative (HR = 0.75). When harm and futility are both active, x8$lower$prob and x8$harm$prob are reported as mutually exclusive stopping outcomes. Thus, the probability of crossing the futility threshold is the sum of the two lower-tail components.

probs <- data.frame(
  Scenario = c(rep("Under H0 (HR=1)", x8$k), rep("Under H1 (HR=0.75)", x8$k)),
  Analysis = rep(1:x8$k, 2),
  Month = rep(x8$T, 2),
  `P(Efficacy)` = c(cumsum(x8$upper$prob[, 1]), cumsum(x8$upper$prob[, 2])),
  `P(Futility only)` = c(cumsum(x8$lower$prob[, 1]), cumsum(x8$lower$prob[, 2])),
  `P(Harm)` = c(cumsum(x8$harm$prob[, 1]), cumsum(x8$harm$prob[, 2])),
  `P(Futility or Harm)` = c(
    cumsum(x8$lower$prob[, 1] + x8$harm$prob[, 1]),
    cumsum(x8$lower$prob[, 2] + x8$harm$prob[, 2])
  ),
  check.names = FALSE
)
kable(probs, digits = 4, caption = "Cumulative boundary crossing probabilities")

Cumulative boundary crossing probabilities
Scenario	Analysis	Month	P(Efficacy)	P(Futility only)	P(Harm)	P(Futility or Harm)
Under H0 (HR=1)	1	12	0.0000	0.0575	0.0173	0.0748
Under H0 (HR=1)	2	24	0.0001	0.5138	0.0416	0.5554
Under H0 (HR=1)	3	36	0.0017	0.8224	0.0417	0.8641
Under H0 (HR=1)	4	48	0.0062	0.9214	0.0417	0.9631
Under H0 (HR=1)	5	60	0.0112	0.9471	0.0417	0.9888
Under H1 (HR=0.75)	1	12	0.0000	0.0034	0.0004	0.0039
Under H1 (HR=0.75)	2	24	0.0574	0.0177	0.0004	0.0181
Under H1 (HR=0.75)	3	36	0.4990	0.0394	0.0004	0.0398
Under H1 (HR=0.75)	4	48	0.7996	0.0670	0.0004	0.0675
Under H1 (HR=0.75)	5	60	0.9000	0.0996	0.0004	0.1000

Under $H_0$, the cumulative probability of crossing the harm bound across all analyses is approximately 0.0417, reflecting the spending allocated to the harm boundary. The cumulative probability of crossing the futility threshold, inclusive of harm, is approximately 0.9888. Under $H_1$ (HR = 0.75), crossing the harm bound is very unlikely (4^{-4}), since the treatment is beneficial.

Visualization

All standard plot() types are supported for test.type = 7 and 8 designs, with a third line (or set of lines) shown for the harm bound.

Z-value boundaries

The default plot shows Z-value boundaries at each analysis. Three boundaries are displayed: efficacy (upper), futility (lower), and harm (below futility).

plot(x8)

Z-value boundaries for non-binding harm bound design

Boundary crossing probabilities

The power plot (plottype = 2) shows cumulative boundary crossing probabilities as a function of the treatment effect. Three sets of lines appear: upper bound (cumulative efficacy crossing probability), 1-(Futility or harm), and 1-Harm. Because the harm boundary is nested below the futility boundary when both are active, crossing the futility threshold includes both futility-only and harm stops. The 1-(Futility or harm) curve therefore subtracts x8$lower$prob + x8$harm$prob, while the 1-Harm curve subtracts harm crossings only. This aggregation is performed only for plotting: the probability arrays stored in x8 remain mutually exclusive so that efficacy, futility-only, and harm outcomes add without double counting. When the underlying treatment effect favors control, the high probability of crossing the harm bound indicates that the harm bound is sensitive and serves its intended purpose.

plot(x8, plottype = 2)

Boundary crossing probabilities for non-binding harm bound design

Approximate treatment effect at boundaries

The effect size plot (plottype = 3) shows the approximate treatment effect at each boundary. For survival designs, this is expressed as the approximate hazard ratio at the boundary.

plot(x8, plottype = 3)

Approximate treatment effect at boundaries

Conditional power at boundaries

Conditional power (plottype = 4) at each interim analysis is shown for all three boundaries. This is generally not a very useful plot.

plot(x8, plottype = 4)

Conditional power at boundaries

Spending function plot

The spending function plot (plottype = 5) shows the three spending functions: $\alpha$ (efficacy), $\beta$ (futility), and harm.

plot(x8, plottype = 5)

Spending functions for non-binding harm bound design

B-values at boundaries

B-values (plottype = 7) are Z-values scaled by $\sqrt{t}$ where $t$ is the information fraction. As discussed by Proschan et al. (2006), the expected value of B-values increases linearly with the information fraction under the assumption of a constant treatment effect (proportional hazards). This linear relationship makes B-values useful for visual assessment of treatment effect trends across interim analyses: departures from linearity may suggest non-proportional hazards or other changes in treatment effect over time. Three boundary lines are shown: efficacy, futility, and harm.

plot(x8, plottype = 7)

B-values at boundaries

Design with binding bounds (`test.type = 7`)

For test.type = 7, both the futility and harm bounds are binding — meaning the computation of the efficacy bound assumes the trial will stop if either bound is crossed. This yields a slightly less conservative efficacy bound (easier to cross), but at the cost of inflated Type I error if the stopping rule is not strictly followed.

We first create a binding design with $\alpha = 0.0125$ to compare with the non-binding design above:

x7 <- gsSurvCalendar(
  test.type = 7,
  alpha = 0.0125,
  beta = 0.1,
  astar = 0.1,
  calendarTime = c(12, 24, 36, 48, 60),
  sfu = sfLDOF,
  sfl = sfHSD, sflpar = -2,
  sfharm = sfLDPocock,
  lambdaC = log(2) / 36,
  hr = 0.75,
  R = 18,
  minfup = 42
)

Comparing binding and non-binding

comparison <- data.frame(
  Bound = c("Efficacy", "Futility", "Harm"),
  `Binding (type 7)` = c(
    paste(round(x7$upper$bound, 3), collapse = ", "),
    paste(round(x7$lower$bound, 3), collapse = ", "),
    paste(round(x7$harm$bound, 3), collapse = ", ")
  ),
  `Non-binding (type 8)` = c(
    paste(round(x8$upper$bound, 3), collapse = ", "),
    paste(round(x8$lower$bound, 3), collapse = ", "),
    paste(round(x8$harm$bound, 3), collapse = ", ")
  ),
  check.names = FALSE
)
kable(comparison, caption = "Comparison of binding vs. non-binding Z-value boundaries")

Comparison of binding vs. non-binding Z-value boundaries
Bound	Binding (type 7)	Non-binding (type 8)
Efficacy	7.434, 3.862, 2.934, 2.523, 2.248	7.434, 3.862, 2.935, 2.528, 2.307
Futility	-1.458, 0.09, 1.016, 1.689, 2.248	-1.441, 0.121, 1.057, 1.736, 2.307
Harm	-2.112, -1.767, -1.726, -1.717, -1.715	-2.112, -1.767, -1.726, -1.717, -1.715

Note that the efficacy bounds for test.type = 7 (binding) are slightly lower (easier to cross) than for test.type = 8 (non-binding). The maximum number of events for test.type = 7 (639) is also slightly smaller than for test.type = 8 (657), reflecting the assumption that the trial will stop at the lower bounds.

gsBoundSummary(x7)
#> Method: LachinFoulkes 
#>     Analysis               Value    Harm Futility Efficacy
#>    IA 1: 11%                   Z -2.1121  -1.4578       NA
#>       N: 746         p (1-sided)  0.9827   0.9275       NA
#>   Events: 71        ~HR at bound  1.6550   1.4158       NA
#>    Month: 12    P(Cross) if HR=1  0.0173   0.0551       NA
#>              P(Cross) if HR=0.75  0.0005   0.0034       NA
#>    IA 2: 38%                   Z -1.7667   0.0895   3.8622
#>      N: 1118         p (1-sided)  0.9614   0.4643   0.0001
#>  Events: 246        ~HR at bound  1.2531   0.9886   0.6107
#>    Month: 24    P(Cross) if HR=1  0.0419   0.5011   0.0001
#>              P(Cross) if HR=0.75  0.0005   0.0176   0.0539
#>    IA 3: 63%                   Z -1.7256   1.0159   2.9344
#>      N: 1118         p (1-sided)  0.9578   0.1548   0.0017
#>  Events: 404        ~HR at bound  1.1874   0.9038   0.7467
#>    Month: 36    P(Cross) if HR=1  0.0421   0.8131   0.0017
#>              P(Cross) if HR=0.75  0.0005   0.0393   0.4829
#>    IA 4: 83%                   Z -1.7170   1.6890   2.5229
#>      N: 1118         p (1-sided)  0.9570   0.0456   0.0058
#>  Events: 533        ~HR at bound  1.1604   0.8639   0.8037
#>    Month: 48    P(Cross) if HR=1  0.0421   0.9172   0.0063
#>              P(Cross) if HR=0.75  0.0005   0.0670   0.7881
#>        Final                   Z -1.7149   2.2480   2.2480
#>      N: 1118         p (1-sided)  0.9568   0.0123   0.0123
#>  Events: 639        ~HR at bound  1.1454   0.8370   0.8370
#>    Month: 60    P(Cross) if HR=1  0.0421   0.9455   0.0125
#>              P(Cross) if HR=0.75  0.0005   0.0995   0.9000

Efficacy bounds at alternate $\alpha$ levels

The gsBoundSummary() function accepts an alpha argument to display efficacy bounds at one or more alternate $\alpha$ levels alongside the original design. Each alternate-alpha column retains the testUpper schedule from the design, so an efficacy analysis that was skipped remains inactive and every efficacy characteristic at that analysis, including cumulative crossing probability, is NA. Here we show the non-binding design (x8) with efficacy bounds for both $\alpha = 0.0125$ (the design level) and $\alpha = 0.025$:

gsBoundSummary(x8, alpha = 0.025)
#>     Analysis               Value α=0.0125 α=0.025 Futility    Harm
#>    IA 1: 11%                   Z       NA      NA  -1.4408 -2.1121
#>       N: 766         p (1-sided)       NA      NA   0.9252  0.9827
#>   Events: 73        ~HR at bound       NA      NA   1.4034  1.6434
#>    Month: 12    P(Cross) if HR=1       NA      NA   0.0575  0.0173
#>              P(Cross) if HR=0.75       NA      NA   0.0034  0.0004
#>    IA 2: 38%                   Z   3.8622  3.4312   0.1212 -1.7667
#>      N: 1148         p (1-sided)   0.0001  0.0003   0.4518  0.9614
#>  Events: 253        ~HR at bound   0.6149  0.6492   0.9849  1.2491
#>    Month: 24    P(Cross) if HR=1   0.0001  0.0003   0.5138  0.0416
#>              P(Cross) if HR=0.75   0.0574  0.1259   0.0177  0.0004
#>    IA 3: 63%                   Z   2.9347  2.5948   1.0566 -1.7256
#>      N: 1148         p (1-sided)   0.0017  0.0047   0.1454  0.9578
#>  Events: 416        ~HR at bound   0.7497  0.7751   0.9015  1.1846
#>    Month: 36    P(Cross) if HR=1   0.0017  0.0048   0.8224  0.0417
#>              P(Cross) if HR=0.75   0.4990  0.6323   0.0394  0.0004
#>    IA 4: 83%                   Z   2.5278  2.2359   1.7357 -1.7170
#>      N: 1148         p (1-sided)   0.0057  0.0127   0.0413  0.9570
#>  Events: 548        ~HR at bound   0.8057  0.8261   0.8622  1.1580
#>    Month: 48    P(Cross) if HR=1   0.0062  0.0138   0.9214  0.0417
#>              P(Cross) if HR=0.75   0.7996  0.8684   0.0670  0.0004
#>        Final                   Z   2.3072  2.0432   2.3072 -1.7149
#>      N: 1148         p (1-sided)   0.0105  0.0205   0.0105  0.9568
#>  Events: 657        ~HR at bound   0.8352  0.8526   0.8352  1.1433
#>    Month: 60    P(Cross) if HR=1   0.0112  0.0201   0.9471  0.0417
#>              P(Cross) if HR=0.75   0.9000  0.9218   0.0996  0.0004

Alternate-alpha summaries are supported for test.type = 8, where both futility and harm bounds are non-binding. They are not supported for test.type = 7: a binding harm or futility bound is outside the non-binding sequential-p-value framework used for Maurer–Bretz graphical multiple testing.

Practical considerations

Choice of spending functions

The choice of spending functions for the three boundaries should reflect regulatory and scientific considerations:

Efficacy: A conservative spending function such as Lan-DeMets O’Brien-Fleming (sfLDOF) is typical, spending very little $\alpha$ at early interim analyses when limited information is available.
Futility: Moderate spending (e.g., HSD with $\gamma = -2$) allows early stopping for futility when the treatment effect is clearly absent.
Harm: The Lan-DeMets Pocock (sfLDPocock) spending function provides more aggressive spending at early analyses, which is appropriate for harm monitoring since detecting a detrimental effect early is critical for patient safety.

Interpreting the harm bound

The harm bound is intended so that if a small observed p-value favoring control is observed, the harm bound will be crossed. In terms of the test statistic, a negative Z-value indicates that the hazard rate is higher in the experimental arm than the control arm — i.e., the experimental treatment appears to be worsening survival. When the Z-value falls below the harm bound, this constitutes a statistical signal that the treatment may be harmful, and the trial should be stopped with a recommendation to review the safety data.

The harm spending is computed under $H_0$ (no treatment effect), reflecting the probability of observing an apparent harmful effect by chance when there is actually no true effect. This controls the probability of a false harm signal.

Harm bound capping

In the implementation, the harm bound is automatically capped so it never exceeds the futility bound when both are active. This ensures the ordering harm bound $\leq$ futility bound $\leq$ efficacy bound at analyses where all three are tested, while allowing harm to be the active lower boundary when futility is skipped.

When to use `test.type = 7` vs. `test.type = 8`

test.type = 8 (non-binding) is most often preferred in practice. Regulators will generally expect non-binding bounds, which preserve Type I error control regardless of whether the stopping rules are strictly followed. Since Data Monitoring Committees (DMCs) typically retain discretion to continue or stop a trial based on the totality of the evidence, the non-binding approach ensures that the statistical validity of the efficacy analysis is maintained even if a futility or harm boundary is crossed but the trial continues.
test.type = 7 (binding) is appropriate when there is a firm commitment to stop the trial upon crossing any boundary. This provides a small efficiency gain (slightly easier efficacy bounds and fewer required events) but requires strict protocol adherence. If the trial does not stop after crossing a binding boundary, Type I error may be inflated.

In most regulatory settings, test.type = 8 is the safer and more common choice.

Binding and non-binding harm monitoring

When futility and harm are both tested at an analysis, the harm boundary lies below the futility boundary and therefore does not add another stopping region. Crossing probabilities are nevertheless partitioned into mutually exclusive harm, futility, and efficacy outcomes.

When harm is tested at an analysis where futility is skipped, the harm boundary is the active lower stopping boundary. This stopping probability is included in sample-size derivation and, for test.type = 7, in the binding efficacy-bound calculation. For test.type = 8, efficacy bounds continue to use the non-binding Type I error convention, while power and expected information account for actual harm and futility stopping.

Adjusting the boundaries

The boundaries are adjustable through several design parameters:

Alternate astar: Controls the Type I error allocated to excess OS harm detection.
Alternate spending functions: Different spending functions for efficacy, futility, and harm boundaries change the aggressiveness of each boundary across analyses.
Alternate timing of analyses: Changing the calendar times of interim analyses shifts the information available at each look.

Regardless of the statistical design, bounds must be clinically, ethically, and statistically sound. As previously noted, this approach is one option to address the regulatory expectation for OS harm monitoring, but other approaches may also be considered.

References

Proschan, Michael A., K. K. Gordon Lan, and Janet Turk Wittes. 2006. Statistical Monitoring of Clinical Trials: A Unified Approach. Springer.

U.S. Food and Drug Administration. 2024. Assessment of Overall Survival Evidence in Support of Accelerated Approval of Oncology Therapeutics: Draft Guidance for Industry. Https://www.fda.gov/media/188274/download.

Futility and harm bounds for overall survival monitoring

Introduction

Regulatory context: FDA guidance on OS monitoring in oncology

Design framework overview

Design with non-binding bounds (test.type = 8)

Spending function specification

Summary

Detailed boundary table

Interpreting the boundaries

Boundary crossing probabilities

Visualization

Z-value boundaries

Boundary crossing probabilities

Approximate treatment effect at boundaries

Conditional power at boundaries

Spending function plot

B-values at boundaries

Design with binding bounds (test.type = 7)

Comparing binding and non-binding

Efficacy bounds at alternate \(\alpha\) levels

Practical considerations

Choice of spending functions

Interpreting the harm bound

Harm bound capping

When to use test.type = 7 vs. test.type = 8

Binding and non-binding harm monitoring

Adjusting the boundaries

References

Design with non-binding bounds (`test.type = 8`)

Design with binding bounds (`test.type = 7`)

When to use `test.type = 7` vs. `test.type = 8`