--- title: "Score vs Wald tests and sample-size recommendations" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Score vs Wald tests and sample-size recommendations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r, message=FALSE, warning=FALSE} library(gsDesignNB) library(data.table) library(ggplot2) library(DT) has_plotly <- requireNamespace("plotly", quietly = TRUE) ``` ## Introduction This vignette compares the Wald and score test implementations in `mutze_test()` across a factorial grid of negative binomial trial scenarios. It also gives practical recommendations for sample-size calculation when the usual Zhu--Lakkis / Friede--Schmidli / Mutze Wald formula is adequate, when score-test sizing is a useful diagnostic, and when the score test itself is the more important change for Type I error control. The **Wald sizing** option in `sample_size_nbinom(test_type = "wald")` uses the alternative variance $V_1$ for both the Type I and power components. The **score sizing** option in `sample_size_nbinom(test_type = "score")` uses the null variance $V_0$ for the Type I component and the alternative variance $V_1$ for the power component: $$ n_1 = \frac{(z_{\alpha/s}\sqrt{V_0} + z_\beta\sqrt{V_1})^2} {(\theta - \theta_0)^2}. $$ This distinction matters most when the planned final analysis uses a score statistic evaluated under the null, or when finite-sample Type I error control is more important than preserving the historical Wald analysis convention. In the superiority scenarios below, the Wald and score sample sizes are close; the traditional Wald sample size paired with the score test often provides a useful practical margin for power while preserving the score test's Type I error protection. The full $2 \times 2$ factorial comparison is: | | Wald-sized trial | Score-sized trial | |---|---|---| | **Wald test** | Wald / Wald | Score / Wald | | **Score test** | Wald / Score | Score / Score | We assess: - **Type I error control** under $H_0: RR = 1$ - **Power** under $H_1: RR = 0.70$ - **Z-score distributions** to check asymptotic normality All tables are interactive: use the column filters to narrow results. Figures are interactive when the optional `plotly` package is installed; otherwise they render as static `ggplot2` figures. Results are pre-computed by `data-raw/generate_score_sweep.R` and loaded here. ## Load pre-computed results ```{r} results_file <- system.file("extdata", "score_sweep_results.rds", package = "gsDesignNB") if (results_file == "" && file.exists("../inst/extdata/score_sweep_results.rds")) { results_file <- "../inst/extdata/score_sweep_results.rds" } res <- readRDS(results_file) config <- res$config scenarios <- as.data.table(res$scenarios) base_grid <- as.data.table(res$base_grid) ``` ```{r} cat(sprintf( "Expanded scenarios: %d | Power sims: %s | Null sims: %s | RR: %.2f | alpha: %.3f\n", nrow(scenarios), format(config$n_sims_power, big.mark = ","), format(config$n_sims_null, big.mark = ","), config$rr_power, config$alpha )) ``` ### Scenario grid The base scenario grid varies control event rate ($\lambda_1$), overdispersion ($k$), and minimum inter-event gap. For each base scenario, sample sizes are computed using both the Wald and score variance formulas. In this superiority grid the score-sized trials are equal to or slightly smaller than the Wald-sized trials; score sizing is therefore not a generic "add a few subjects" rule, and the operating characteristics still need to be checked under the planned analysis test. ```{r} base_display <- base_grid[, .( `Control rate` = lambda1, `Dispersion (k)` = k, `Event gap (days)` = gap_days, `N (Wald sizing)` = n_wald, `N (Score sizing)` = n_score, `Wald - Score` = n_wald - n_score )] datatable( base_display, caption = "Base scenario grid with sample sizes by method", filter = "top", options = list(pageLength = 15, dom = "ftip"), rownames = FALSE ) |> formatRound(columns = "Control rate", digits = 2) ``` ## Type I error comparison ```{r} null_dt <- as.data.table(res$null_summary) null_long <- melt( null_dt, id.vars = c("lambda1", "k", "gap_days", "n_total", "sizing"), measure.vars = c("rate_wald", "rate_score"), variable.name = "test", value.name = "rejection_rate" ) null_long[, test := fifelse(test == "rate_wald", "Wald", "Score")] se_long <- melt( null_dt, id.vars = c("lambda1", "k", "gap_days", "sizing"), measure.vars = c("se_wald", "se_score"), variable.name = "test", value.name = "se" ) se_long[, test := fifelse(test == "se_wald", "Wald", "Score")] null_long <- merge(null_long, se_long, by = c("lambda1", "k", "gap_days", "sizing", "test")) null_long[, combo := paste0(sizing, "-sized / ", test, " test")] null_long[, `:=`( above_nominal_95 = rejection_rate - 1.96 * se > config$alpha, below_nominal_95 = rejection_rate + 1.96 * se < config$alpha )] ``` ```{r} type1_summary <- null_long[, .( `Scenarios` = .N, `Minimum` = min(rejection_rate), `Mean` = mean(rejection_rate), `Maximum` = max(rejection_rate), `Above nominal beyond MC error` = sum(above_nominal_95), `Below nominal beyond MC error` = sum(below_nominal_95) ), by = .(`Sizing` = sizing, `Test` = test)] datatable( type1_summary[order(Sizing, Test)], caption = htmltools::tags$caption( style = "caption-side: top; text-align: left; font-weight: bold;", "Type I error synopsis across the scenario grid" ), options = list(pageLength = 8, dom = "tip"), rownames = FALSE ) |> formatRound(columns = c("Minimum", "Mean", "Maximum"), digits = 4) ``` ```{r} null_display <- null_long[order(lambda1, k, gap_days, sizing, test), .( `Control rate` = lambda1, Dispersion = k, `Gap (days)` = gap_days, Sizing = sizing, N = n_total, Test = test, `Type I error` = round(rejection_rate, 4), SE = round(se, 4) ) ] datatable( null_display, caption = htmltools::tags$caption( style = "caption-side: top; text-align: left; font-weight: bold;", sprintf("Type I error rate — nominal alpha = %.3f, %s null sims/scenario", config$alpha, format(config$n_sims_null, big.mark = ",")) ), filter = "top", options = list(pageLength = 20, dom = "ftip"), rownames = FALSE ) ``` ```{r, fig.width=8, fig.height=6} null_long[, scenario := paste0("λ₁=", lambda1, " k=", k)] p_null <- ggplot(null_long, aes(x = scenario, y = rejection_rate, color = combo, shape = test, text = paste0("Sizing: ", sizing, "\nTest: ", test, "\nRate: ", round(rejection_rate, 4), "\nN: ", n_total))) + geom_point(size = 2.5, position = position_dodge(width = 0.5)) + geom_errorbar(aes(ymin = rejection_rate - 1.96 * se, ymax = rejection_rate + 1.96 * se), width = 0.2, position = position_dodge(width = 0.5)) + geom_hline(yintercept = config$alpha, linetype = "dashed", color = "grey40") + facet_wrap(~ paste0("Gap = ", gap_days, "d")) + labs( title = "Type I error: sizing method × test type", x = NULL, y = "Rejection rate", color = "Sizing / Test", shape = "Test" ) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) if (has_plotly) { plotly::ggplotly(p_null, tooltip = "text") |> plotly::layout(legend = list(orientation = "v", x = 1.02, y = 0.5)) } else { p_null } ``` ## Power comparison ```{r} power_dt <- as.data.table(res$power_summary) power_long <- melt( power_dt, id.vars = c("lambda1", "k", "gap_days", "n_total", "sizing"), measure.vars = c("rate_wald", "rate_score"), variable.name = "test", value.name = "power" ) power_long[, test := fifelse(test == "rate_wald", "Wald", "Score")] se_power <- melt( power_dt, id.vars = c("lambda1", "k", "gap_days", "sizing"), measure.vars = c("se_wald", "se_score"), variable.name = "test", value.name = "se" ) se_power[, test := fifelse(test == "se_wald", "Wald", "Score")] power_long <- merge(power_long, se_power, by = c("lambda1", "k", "gap_days", "sizing", "test")) power_long[, combo := paste0(sizing, "-sized / ", test, " test")] ``` ```{r} power_summary <- power_long[, .( `Scenarios` = .N, `Minimum` = min(power), `Mean` = mean(power), `Maximum` = max(power), `Below 90%` = sum(power < config$power_target) ), by = .(`Sizing` = sizing, `Test` = test)] datatable( power_summary[order(Sizing, Test)], caption = htmltools::tags$caption( style = "caption-side: top; text-align: left; font-weight: bold;", "Power synopsis across the scenario grid" ), options = list(pageLength = 8, dom = "tip"), rownames = FALSE ) |> formatRound(columns = c("Minimum", "Mean", "Maximum"), digits = 4) ``` ```{r} power_display <- power_long[order(lambda1, k, gap_days, sizing, test), .( `Control rate` = lambda1, Dispersion = k, `Gap (days)` = gap_days, Sizing = sizing, N = n_total, Test = test, Power = round(power, 4), SE = round(se, 4) ) ] datatable( power_display, caption = htmltools::tags$caption( style = "caption-side: top; text-align: left; font-weight: bold;", sprintf("Power — RR = %.2f, %s power sims/scenario", config$rr_power, format(config$n_sims_power, big.mark = ",")) ), filter = "top", options = list(pageLength = 20, dom = "ftip"), rownames = FALSE ) ``` ```{r, fig.width=8, fig.height=6} power_long[, scenario := paste0("λ₁=", lambda1, " k=", k)] p_power <- ggplot(power_long, aes(x = scenario, y = power, color = combo, shape = test, text = paste0("Sizing: ", sizing, "\nTest: ", test, "\nPower: ", round(power, 4), "\nN: ", n_total))) + geom_point(size = 2.5, position = position_dodge(width = 0.5)) + geom_errorbar(aes(ymin = power - 1.96 * se, ymax = power + 1.96 * se), width = 0.2, position = position_dodge(width = 0.5)) + geom_hline(yintercept = config$power_target, linetype = "dashed", color = "grey40") + facet_wrap(~ paste0("Gap = ", gap_days, "d")) + labs( title = "Power: sizing method × test type", subtitle = sprintf("Target = %.0f%%, RR = %.2f", 100 * config$power_target, config$rr_power), x = NULL, y = "Power", color = "Sizing / Test", shape = "Test" ) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) if (has_plotly) { plotly::ggplotly(p_power, tooltip = "text") |> plotly::layout(legend = list(orientation = "v", x = 1.02, y = 0.5)) } else { p_power } ``` ## Z-score density comparison (null simulations) Under $H_0$, the Z-statistics should follow $N(0, 1)$ if the asymptotic approximation holds. ```{r, fig.width=9, fig.height=7} z_null <- as.data.table(res$z_sample_null) sc_info <- data.table( scenario_id = seq_len(nrow(scenarios)), scenarios[, .(lambda1, k, gap_days, sizing)] ) z_null <- merge(z_null, sc_info, by = "scenario_id") z_null[, label := sprintf("l1=%.2f k=%.1f gap=%dd (%s)", lambda1, k, gap_days, sizing)] z_null_long <- melt( z_null, id.vars = c("scenario_id", "label", "sizing"), measure.vars = c("z_wald", "z_score"), variable.name = "test", value.name = "z" ) z_null_long[, test := fifelse(test == "z_wald", "Wald", "Score")] z_null_long <- z_null_long[is.finite(z)] p_z <- ggplot(z_null_long, aes(x = z, color = test)) + geom_density(linewidth = 0.6) + stat_function(fun = dnorm, color = "black", linetype = "dashed", linewidth = 0.4) + facet_wrap(~ label, scales = "free_y") + labs( title = "Null Z-score densities: Wald vs Score", subtitle = "Dashed line = N(0,1) reference", x = "Z-statistic", y = "Density", color = "Test" ) + theme_minimal() + coord_cartesian(xlim = c(-4, 4)) if (has_plotly) { plotly::ggplotly(p_z) |> plotly::layout(legend = list(orientation = "h", x = 0.3, y = -0.05)) } else { p_z } ``` ## Fallback method frequency When the negative binomial MLE fails to converge or yields non-overdispersed estimates, `mutze_test()` falls back to Poisson or method-of-moments estimation. ```{r} null_fb <- as.data.table(res$null_summary) fb_display <- null_fb[, .( `Control rate` = lambda1, Dispersion = k, `Gap (days)` = gap_days, Sizing = sizing, `Poisson (Wald)` = round(pct_fallback_poisson_wald, 1), `MoM (Wald)` = round(pct_fallback_mom_wald, 1), `Poisson (Score)` = round(pct_fallback_poisson_score, 1), `MoM (Score)` = round(pct_fallback_mom_score, 1) )] datatable( fb_display, caption = "Fallback method frequency (%, null sims)", filter = "top", options = list(pageLength = 20, dom = "ftip"), rownames = FALSE ) ``` ## Summary - The choice of final analysis test matters more than the choice of Wald versus score sizing in this grid. Wald and score sample sizes differ by at most four subjects, while the null rejection pattern changes materially when the test statistic changes. - The Wald test is mildly anti-conservative in several finite-sample scenarios. Across the grid, its empirical Type I error ranges from `r round(min(null_long[test == "Wald"]$rejection_rate), 4)` to `r round(max(null_long[test == "Wald"]$rejection_rate), 4)`, and many Wald cells are above nominal beyond Monte Carlo error. - The score test provides better Type I protection. Its empirical Type I error ranges from `r round(min(null_long[test == "Score"]$rejection_rate), 4)` to `r round(max(null_long[test == "Score"]$rejection_rate), 4)`; no score-test cell is above nominal beyond Monte Carlo error, although several are conservatively below nominal. - Use the Zhu--Lakkis / Friede--Schmidli / Mutze Wald formula when the planned primary analysis is the Wald log-rate-ratio test. It is also a reasonable practical baseline when the planned primary analysis is the score test, because in this superiority grid Wald sizing gives the score test a small sample-size margin. - Move to the score-test workflow when Type I error preservation is the primary concern, especially for lower-information designs, high dispersion, event gaps, non-inferiority or super-superiority margins, or adaptive/group sequential settings. In that workflow, analyze and simulate with `mutze_test(test_type = "score")`, `sim_gs_nbinom(test_type = "score")`, or `sim_ssr_nbinom(test_type = "score")`; compare Wald and score sizing rather than assuming the two-variance score formula will automatically deliver nominal power. - Score-test power is slightly conservative in this grid. If the score test is the planned primary analysis, verify power by simulation and consider retaining Wald sizing, increasing the target power, or adding a modest information margin before finalizing the protocol.