When clinical trials include overall survival (OS) as a secondary or
exploratory endpoint, regulators may recommend not only monitoring for
early evidence of efficacy and futility, but also for potential
harm — that is, evidence that the experimental treatment may be
worsening survival relative to control. This article
demonstrates how the gsDesign package supports group
sequential designs with three boundaries: an efficacy
(upper) bound, a futility (lower) bound, and a
harm bound, using test.type = 7 (binding)
and test.type = 8 (non-binding).
The FDA draft guidance Assessment of Overall Survival Evidence in Support of Accelerated Approval of Oncology Therapeutics (U.S. Food and Drug Administration 2024) describes expectations for monitoring OS in the context of trials that may receive accelerated approval based on surrogate endpoints. The guidance states that sponsors should specify pre-planned boundaries for interim OS monitoring, including criteria for stopping a trial early if there is evidence of a detrimental effect on OS. Key points include:
This motivates the design framework with test.type = 7
(binding futility and harm bounds) and test.type = 8
(non-binding futility and harm bounds), where three boundaries are
simultaneously specified using spending functions.
The harm bound implemented in gsDesign is a new method that is easy to use — a principled, straightforward extension of the widely used group sequential spending function framework. While we believe this approach is understandable, useful, and flexible, other methods for monitoring potential harm may also be considered. However, there are limitations with this approach. The example presented here has higher mortality risk than many cases. With lower mortality risk, modifications of this approach or other approaches may be preferable.
In a standard two-sided asymmetric group sequential design
(test.type = 3 or 4), there are two
boundaries:
The harm bound extension (test.type = 7 or
8) adds a third boundary:
The harm bound lies below the futility bound. At each analysis, there are four possible outcomes:
The harm bound is intended so that if a small observed p-value favoring control is observed, the harm bound will be crossed. That is, the harm bound flags evidence that the experimental treatment may be worsening survival — a negative treatment effect on the log hazard ratio scale.
test.type = 8)We demonstrate a survival design using gsSurvCalendar()
with test.type = 8 (non-binding futility and harm bounds).
The scenario is based on a 1:1 randomized trial monitoring overall
survival with:
The astar parameter controls the total spending for the
harm bound under \(H_0\). We set
astar = 0.1, meaning the total probability of crossing the
harm bound under \(H_0\) is 10%.
We specify:
sfLDOF) spending function (conservative, spending little
\(\alpha\) at early analyses).sfLDPocock) spending function (spending under \(H_0\) for detecting harm).The summary() method provides a concise description of
the design:
cat(strwrap(summary(x8), width = 65), sep = "\n")
#> Asymmetric two-sided group sequential design with non-binding
#> futility and harm bounds, 5 analyses, time-to-event outcome with
#> sample size 1148 and 657 events required, 90 percent power, 1.25
#> percent (1-sided) Type I error to detect a hazard ratio of 0.75.
#> Enrollment and total study durations are assumed to be 18 and 60
#> months, respectively. Efficacy bounds derived using a Lan-DeMets
#> O'Brien-Fleming approximation spending function (no parameters).
#> Futility bounds derived using a Hwang-Shih-DeCani spending
#> function with gamma = -2. Harm bounds derived using a Lan-DeMets
#> Pocock approximation spending function.The gsBoundSummary() function produces a tabular summary
with columns for each boundary. By default, B-value,
Spending, CP, CP H1, and
PP are excluded. We note that for the first interim
analysis, the efficacy bound is so extreme it is effectively impossible
to cross. However, the harm and futility bounds are more moderate,
allowing for early stopping if there is evidence of harm or futility.
The futility bound is an indicator of why bounds are often non-binding —
the futility bound is not intended to be a strict stopping rule, but
rather a signal that the trial may be unlikely to succeed if it
continues. Crossing the harm bound is a stronger indication that the
treatment may be harmful, and the trial should be at least paused with a
recommendation to review the safety and other endpoint data.
gsBoundSummary(x8)
#> Method: LachinFoulkes
#> Analysis Value Harm Futility Efficacy
#> IA 1: 11% Z -2.1121 -1.4408 NA
#> N: 766 p (1-sided) 0.9827 0.9252 NA
#> Events: 73 ~HR at bound 1.6434 1.4034 NA
#> Month: 12 P(Cross) if HR=1 0.0173 0.0748 NA
#> P(Cross) if HR=0.75 0.0004 0.0039 NA
#> IA 2: 38% Z -1.7667 0.1212 3.8622
#> N: 1148 p (1-sided) 0.9614 0.4518 0.0001
#> Events: 253 ~HR at bound 1.2491 0.9849 0.6149
#> Month: 24 P(Cross) if HR=1 0.0507 0.5554 0.0001
#> P(Cross) if HR=0.75 0.0004 0.0181 0.0574
#> IA 3: 63% Z -1.7256 1.0566 2.9347
#> N: 1148 p (1-sided) 0.9578 0.1454 0.0017
#> Events: 416 ~HR at bound 1.1846 0.9015 0.7497
#> Month: 36 P(Cross) if HR=1 0.0736 0.8641 0.0017
#> P(Cross) if HR=0.75 0.0004 0.0398 0.4990
#> IA 4: 83% Z -1.7170 1.7357 2.5278
#> N: 1148 p (1-sided) 0.9570 0.0413 0.0057
#> Events: 548 ~HR at bound 1.1580 0.8622 0.8057
#> Month: 48 P(Cross) if HR=1 0.0890 0.9631 0.0062
#> P(Cross) if HR=0.75 0.0004 0.0675 0.7996
#> Final Z -1.7149 2.3072 2.3072
#> N: 1148 p (1-sided) 0.9568 0.0105 0.0105
#> Events: 657 ~HR at bound 1.1433 0.8352 0.8352
#> Month: 60 P(Cross) if HR=1 0.1000 0.9888 0.0112
#> P(Cross) if HR=0.75 0.0004 0.1000 0.9000Conditional power (CP, CP H1) and predictive power (PP) can also be included in the summary. Below we show the full table with all statistics, including conditional and predictive power at each boundary:
gsBoundSummary(x8, exclude = c())
#> Method: LachinFoulkes
#> Analysis Value Harm Futility Efficacy
#> IA 1: 11% Z -2.1121 -1.4408 NA
#> N: 766 p (1-sided) 0.9827 0.9252 NA
#> Events: 73 ~HR at bound 1.6434 1.4034 NA
#> Month: 12 Spending 0.0173 0.0039 NA
#> B-value -0.7011 -0.4782 NA
#> CP 0.0000 0.0000 NA
#> CP H1 0.4619 0.5942 NA
#> PP 0.0011 0.0097 NA
#> P(Cross) if HR=1 0.0173 0.0748 NA
#> P(Cross) if HR=0.75 0.0004 0.0039 NA
#> IA 2: 38% Z -1.7667 0.1212 3.8622
#> N: 1148 p (1-sided) 0.9614 0.4518 0.0001
#> Events: 253 ~HR at bound 1.2491 0.9849 0.6149
#> Month: 24 Spending 0.0334 0.0143 0.0001
#> B-value -1.0954 0.0751 2.3947
#> CP 0.0000 0.0024 1.0000
#> CP H1 0.0097 0.4033 0.9994
#> PP 0.0000 0.0358 0.9994
#> P(Cross) if HR=1 0.0507 0.5554 0.0001
#> P(Cross) if HR=0.75 0.0004 0.0181 0.0574
#> IA 3: 63% Z -1.7256 1.0566 2.9347
#> N: 1148 p (1-sided) 0.9578 0.1454 0.0017
#> Events: 416 ~HR at bound 1.1846 0.9015 0.7497
#> Month: 36 Spending 0.0229 0.0217 0.0016
#> B-value -1.3725 0.8404 2.3343
#> CP 0.0000 0.0396 0.9928
#> CP H1 0.0000 0.3449 0.9928
#> PP 0.0000 0.0776 0.9759
#> P(Cross) if HR=1 0.0736 0.8641 0.0017
#> P(Cross) if HR=0.75 0.0004 0.0398 0.4990
#> IA 4: 83% Z -1.7170 1.7357 2.5278
#> N: 1148 p (1-sided) 0.9570 0.0413 0.0057
#> Events: 548 ~HR at bound 1.1580 0.8622 0.8057
#> Month: 48 Spending 0.0154 0.0277 0.0046
#> B-value -1.5689 1.5860 2.3098
#> CP 0.0000 0.1578 0.8708
#> CP H1 0.0000 0.3906 0.9337
#> PP 0.0000 0.1793 0.8485
#> P(Cross) if HR=1 0.0890 0.9631 0.0062
#> P(Cross) if HR=0.75 0.0004 0.0675 0.7996
#> Final Z -1.7149 2.3072 2.3072
#> N: 1148 p (1-sided) 0.9568 0.0105 0.0105
#> Events: 657 ~HR at bound 1.1433 0.8352 0.8352
#> Month: 60 Spending 0.0110 0.0325 0.0062
#> B-value -1.7149 2.3072 2.3072
#> P(Cross) if HR=1 0.1000 0.9888 0.0112
#> P(Cross) if HR=0.75 0.0004 0.1000 0.9000The design has five analyses at calendar times of 12, 24, 36, 48, and 60 months. At each analysis, the test statistic (Z-value) is compared against three boundaries:
bounds <- data.frame(
Analysis = 1:x8$k,
Month = x8$T,
Events = ceiling(x8$n.I),
Harm = round(x8$harm$bound, 2),
Futility = round(x8$lower$bound, 2),
Efficacy = round(x8$upper$bound, 2)
)
kable(bounds, caption = "Z-value boundaries at each analysis")| Analysis | Month | Events | Harm | Futility | Efficacy |
|---|---|---|---|---|---|
| 1 | 12 | 73 | -2.11 | -1.44 | 7.43 |
| 2 | 24 | 253 | -1.77 | 0.12 | 3.86 |
| 3 | 36 | 416 | -1.73 | 1.06 | 2.93 |
| 4 | 48 | 548 | -1.72 | 1.74 | 2.53 |
| 5 | 60 | 657 | -1.71 | 2.31 | 2.31 |
Decision rules at each analysis:
Note that the harm bound is always at or below the futility bound. At early analyses, the harm and futility bounds may coincide when the harm spending function has not yet allocated sufficient spending to differentiate them.
We examine the operating characteristics under two scenarios: no treatment effect (HR = 1, i.e., under \(H_0\)) and the design alternative (HR = 0.75).
probs <- data.frame(
Scenario = c(rep("Under H0 (HR=1)", x8$k), rep("Under H1 (HR=0.75)", x8$k)),
Analysis = rep(1:x8$k, 2),
Month = rep(x8$T, 2),
`P(Efficacy)` = c(cumsum(x8$upper$prob[, 1]), cumsum(x8$upper$prob[, 2])),
`P(Futility)` = c(cumsum(x8$lower$prob[, 1]), cumsum(x8$lower$prob[, 2])),
`P(Harm)` = c(cumsum(x8$harm$prob[, 1]), cumsum(x8$harm$prob[, 2])),
check.names = FALSE
)
kable(probs, digits = 4, caption = "Cumulative boundary crossing probabilities")| Scenario | Analysis | Month | P(Efficacy) | P(Futility) | P(Harm) |
|---|---|---|---|---|---|
| Under H0 (HR=1) | 1 | 12 | 0.0000 | 0.0748 | 0.0173 |
| Under H0 (HR=1) | 2 | 24 | 0.0001 | 0.5554 | 0.0507 |
| Under H0 (HR=1) | 3 | 36 | 0.0017 | 0.8641 | 0.0736 |
| Under H0 (HR=1) | 4 | 48 | 0.0062 | 0.9631 | 0.0890 |
| Under H0 (HR=1) | 5 | 60 | 0.0112 | 0.9888 | 0.1000 |
| Under H1 (HR=0.75) | 1 | 12 | 0.0000 | 0.0039 | 0.0004 |
| Under H1 (HR=0.75) | 2 | 24 | 0.0574 | 0.0181 | 0.0004 |
| Under H1 (HR=0.75) | 3 | 36 | 0.4990 | 0.0398 | 0.0004 |
| Under H1 (HR=0.75) | 4 | 48 | 0.7996 | 0.0675 | 0.0004 |
| Under H1 (HR=0.75) | 5 | 60 | 0.9000 | 0.1000 | 0.0004 |
Under \(H_0\), the cumulative probability of crossing the harm bound across all analyses is approximately 0.1, reflecting the spending allocated to the harm boundary. Under \(H_1\) (HR = 0.75), crossing the harm bound is very unlikely (4^{-4}), since the treatment is beneficial.
All standard plot() types are supported for
test.type = 7 and 8 designs, with a third line
(or set of lines) shown for the harm bound.
The default plot shows Z-value boundaries at each analysis. Three boundaries are displayed: efficacy (upper), futility (lower), and harm (below futility).
Z-value boundaries for non-binding harm bound design
The power plot (plottype = 2) shows cumulative boundary
crossing probabilities as a function of the treatment effect. Three sets
of lines appear: upper bound (cumulative efficacy crossing probability),
1-Futility bound, and 1-Harm bound. The harm lines are above the
futility lines because the probability of crossing the harm bound is
less than or equal to the probability of crossing the futility bound. We
note that when the underlying treatment effect favors control, the high
probability of crossing the harm bound indicates that the harm bound is
sensitive and serves its intended purpose
Boundary crossing probabilities for non-binding harm bound design
The effect size plot (plottype = 3) shows the
approximate treatment effect at each boundary. For survival designs,
this is expressed as the approximate hazard ratio at the boundary.
Approximate treatment effect at boundaries
Conditional power (plottype = 4) at each interim
analysis is shown for all three boundaries. This is generally not a very
useful plot.
Conditional power at boundaries
The spending function plot (plottype = 5) shows the
three spending functions: \(\alpha\)
(efficacy), \(\beta\) (futility), and
harm.
Spending functions for non-binding harm bound design
B-values (plottype = 7) are Z-values scaled by \(\sqrt{t}\) where \(t\) is the information fraction. As
discussed by Proschan et al. (2006), the
expected value of B-values increases linearly with the information
fraction under the assumption of a constant treatment effect
(proportional hazards). This linear relationship makes B-values useful
for visual assessment of treatment effect trends across interim
analyses: departures from linearity may suggest non-proportional hazards
or other changes in treatment effect over time. Three boundary lines are
shown: efficacy, futility, and harm.
B-values at boundaries
test.type = 7)For test.type = 7, both the futility and harm bounds are
binding — meaning the computation of the efficacy bound
assumes the trial will stop if either bound is crossed. This
yields a slightly less conservative efficacy bound (easier to cross),
but at the cost of inflated Type I error if the stopping rule is not
strictly followed.
We first create a binding design with \(\alpha = 0.0125\) to compare with the non-binding design above:
x7 <- gsSurvCalendar(
test.type = 7,
alpha = 0.0125,
beta = 0.1,
astar = 0.1,
calendarTime = c(12, 24, 36, 48, 60),
sfu = sfLDOF,
sfl = sfHSD, sflpar = -2,
sfharm = sfLDPocock,
lambdaC = log(2) / 36,
hr = 0.75,
R = 18,
minfup = 42
)comparison <- data.frame(
Bound = c("Efficacy", "Futility", "Harm"),
`Binding (type 7)` = c(
paste(round(x7$upper$bound, 3), collapse = ", "),
paste(round(x7$lower$bound, 3), collapse = ", "),
paste(round(x7$harm$bound, 3), collapse = ", ")
),
`Non-binding (type 8)` = c(
paste(round(x8$upper$bound, 3), collapse = ", "),
paste(round(x8$lower$bound, 3), collapse = ", "),
paste(round(x8$harm$bound, 3), collapse = ", ")
),
check.names = FALSE
)
kable(comparison, caption = "Comparison of binding vs. non-binding Z-value boundaries")| Bound | Binding (type 7) | Non-binding (type 8) |
|---|---|---|
| Efficacy | 7.434, 3.862, 2.934, 2.523, 2.248 | 7.434, 3.862, 2.935, 2.528, 2.307 |
| Futility | -1.458, 0.09, 1.016, 1.689, 2.248 | -1.441, 0.121, 1.057, 1.736, 2.307 |
| Harm | -2.112, -1.767, -1.726, -1.717, -1.715 | -2.112, -1.767, -1.726, -1.717, -1.715 |
Note that the efficacy bounds for test.type = 7
(binding) are slightly lower (easier to cross) than for
test.type = 8 (non-binding). The maximum number of events
for test.type = 7 (639) is also slightly smaller than for
test.type = 8 (657), reflecting the assumption that the
trial will stop at the lower bounds.
gsBoundSummary(x7)
#> Method: LachinFoulkes
#> Analysis Value Harm Futility Efficacy
#> IA 1: 11% Z -2.1121 -1.4578 NA
#> N: 746 p (1-sided) 0.9827 0.9275 NA
#> Events: 71 ~HR at bound 1.6550 1.4158 NA
#> Month: 12 P(Cross) if HR=1 0.0173 0.0725 NA
#> P(Cross) if HR=0.75 0.0005 0.0039 NA
#> IA 2: 38% Z -1.7667 0.0895 3.8622
#> N: 1118 p (1-sided) 0.9614 0.4643 0.0001
#> Events: 246 ~HR at bound 1.2531 0.9886 0.6107
#> Month: 24 P(Cross) if HR=1 0.0507 0.5430 0.0001
#> P(Cross) if HR=0.75 0.0005 0.0181 0.0539
#> IA 3: 63% Z -1.7256 1.0159 2.9344
#> N: 1118 p (1-sided) 0.9578 0.1548 0.0017
#> Events: 404 ~HR at bound 1.1874 0.9038 0.7467
#> Month: 36 P(Cross) if HR=1 0.0736 0.8551 0.0017
#> P(Cross) if HR=0.75 0.0005 0.0398 0.4829
#> IA 4: 83% Z -1.7170 1.6890 2.5229
#> N: 1118 p (1-sided) 0.9570 0.0456 0.0058
#> Events: 533 ~HR at bound 1.1604 0.8639 0.8037
#> Month: 48 P(Cross) if HR=1 0.0890 0.9592 0.0063
#> P(Cross) if HR=0.75 0.0005 0.0675 0.7881
#> Final Z -1.7149 2.2480 2.2480
#> N: 1118 p (1-sided) 0.9568 0.0123 0.0123
#> Events: 639 ~HR at bound 1.1454 0.8370 0.8370
#> Month: 60 P(Cross) if HR=1 0.1000 0.9875 0.0125
#> P(Cross) if HR=0.75 0.0005 0.1000 0.9000The gsBoundSummary() function accepts an
alpha argument to display efficacy bounds at one or more
alternate \(\alpha\) levels alongside
the original design. Here we show the non-binding design
(x8) with efficacy bounds for both \(\alpha = 0.0125\) (the design level) and
\(\alpha = 0.025\):
gsBoundSummary(x8, alpha = 0.025)
#> Analysis Value α=0.0125 α=0.025 Futility Harm
#> IA 1: 11% Z NA 6.6513 -1.4408 -2.1121
#> N: 766 p (1-sided) NA 0.0000 0.9252 0.9827
#> Events: 73 ~HR at bound NA 0.2092 1.4034 1.6434
#> Month: 12 P(Cross) if HR=1 NA 0.0000 0.0748 0.0173
#> P(Cross) if HR=0.75 NA 0.0000 0.0039 0.0004
#> IA 2: 38% Z 3.8622 3.4312 0.1212 -1.7667
#> N: 1148 p (1-sided) 0.0001 0.0003 0.4518 0.9614
#> Events: 253 ~HR at bound 0.6149 0.6492 0.9849 1.2491
#> Month: 24 P(Cross) if HR=1 0.0001 0.0003 0.5554 0.0507
#> P(Cross) if HR=0.75 0.0574 0.1259 0.0181 0.0004
#> IA 3: 63% Z 2.9347 2.5948 1.0566 -1.7256
#> N: 1148 p (1-sided) 0.0017 0.0047 0.1454 0.9578
#> Events: 416 ~HR at bound 0.7497 0.7751 0.9015 1.1846
#> Month: 36 P(Cross) if HR=1 0.0017 0.0048 0.8641 0.0736
#> P(Cross) if HR=0.75 0.4990 0.6323 0.0398 0.0004
#> IA 4: 83% Z 2.5278 2.2359 1.7357 -1.7170
#> N: 1148 p (1-sided) 0.0057 0.0127 0.0413 0.9570
#> Events: 548 ~HR at bound 0.8057 0.8261 0.8622 1.1580
#> Month: 48 P(Cross) if HR=1 0.0062 0.0138 0.9631 0.0890
#> P(Cross) if HR=0.75 0.7996 0.8684 0.0675 0.0004
#> Final Z 2.3072 2.0432 2.3072 -1.7149
#> N: 1148 p (1-sided) 0.0105 0.0205 0.0105 0.9568
#> Events: 657 ~HR at bound 0.8352 0.8526 0.8352 1.1433
#> Month: 60 P(Cross) if HR=1 0.0112 0.0201 0.9888 0.1000
#> P(Cross) if HR=0.75 0.9000 0.9218 0.1000 0.0004The choice of spending functions for the three boundaries should reflect regulatory and scientific considerations:
sfLDOF) is typical, spending
very little \(\alpha\) at early interim
analyses when limited information is available.sfLDPocock) spending function provides more aggressive
spending at early analyses, which is appropriate for harm monitoring
since detecting a detrimental effect early is critical for patient
safety.The harm bound is intended so that if a small observed p-value favoring control is observed, the harm bound will be crossed. In terms of the test statistic, a negative Z-value indicates that the hazard rate is higher in the experimental arm than the control arm — i.e., the experimental treatment appears to be worsening survival. When the Z-value falls below the harm bound, this constitutes a statistical signal that the treatment may be harmful, and the trial should be stopped with a recommendation to review the safety data.
The harm spending is computed under \(H_0\) (no treatment effect), reflecting the probability of observing an apparent harmful effect by chance when there is actually no true effect. This controls the probability of a false harm signal.
In the implementation, the harm bound is automatically capped so it never exceeds the futility bound. This ensures the ordering: harm bound \(\leq\) futility bound \(\leq\) efficacy bound at every analysis.
test.type = 7
vs. test.type = 8test.type = 8 (non-binding) is most
often preferred in practice. Regulators will generally expect
non-binding bounds, which preserve Type I error control regardless of
whether the stopping rules are strictly followed. Since Data Monitoring
Committees (DMCs) typically retain discretion to continue or stop a
trial based on the totality of the evidence, the non-binding approach
ensures that the statistical validity of the efficacy analysis is
maintained even if a futility or harm boundary is crossed but the trial
continues.test.type = 7 (binding) is appropriate
when there is a firm commitment to stop the trial upon crossing any
boundary. This provides a small efficiency gain (slightly easier
efficacy bounds and fewer required events) but requires strict protocol
adherence. If the trial does not stop after crossing a binding boundary,
Type I error may be inflated.In most regulatory settings, test.type = 8 is the safer
and more common choice.
One might consider a design where the futility bound is non-binding
but the harm bound is binding. In practice, such a distinction has no
computational effect. The harm bound is computed after the
efficacy and futility bounds are set and does not feed back into those
computations. When the futility bound is non-binding (as in
test.type = 8), the efficacy bound is computed ignoring all
lower-bound stopping. Since the harm bound lies below the futility
bound, making the harm bound “binding” while the futility bound remains
non-binding would not change the efficacy boundary, the required number
of events, the final Z-values, or the p-values — the results are
identical.
The only difference would be in interpretation: whether
crossing the harm bound is treated as a firm commitment to stop or as
advisory information for the DMC. This interpretive distinction does not
require a separate test.type; it can be addressed in the
protocol language and the DMC charter. The test.type = 8
framework already provides full flexibility for the DMC to treat the
harm bound as either advisory or mandatory.
The boundaries are adjustable through several design parameters:
astar: Controls the Type I
error allocated to excess OS harm detection.Regardless of the statistical design, bounds must be clinically, ethically, and statistically sound. As previously noted, this approach is one option to address the regulatory expectation for OS harm monitoring, but other approaches may also be considered.