---
title: "Multiple imputation for longitudinal negative binomial counts"
output: rmarkdown::html_vignette
bibliography: gsDesignNB.bib
vignette: >
  %\VignetteIndexEntry{Multiple imputation for longitudinal negative binomial counts}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>",
  fig.width  = 7,
  fig.height = 4
)
```

## Introduction

Longitudinal clinical trials with recurrent-event count endpoints (e.g., MRI
lesion counts in multiple sclerosis, exacerbation counts in COPD) routinely
encounter missing data due to intercurrent events (ICEs) such as premature
treatment discontinuation, adverse events requiring rescue therapy, or death.
Proper handling of these missing values requires a principled strategy that
reflects the scientific question—the *estimand*—as defined by the ICH E9(R1)
addendum.

The `gsDesignNB` package provides three imputation strategies for longitudinal
negative binomial count data:

| Mechanism | Flag value | Strategy |
|:----------|:-----------|:---------|
| Missing at Random | `"MAR"` | GLMM-based imputation with subject BLUPs |
| Missing Not at Random (non-reference arm) | `"MNAR"` | Reference-based (copy-reference) imputation |
| Intercurrent event — composite | `"Comp"` | Baseline count carried forward |

These mirror the SAS implementation using `PROC GLIMMIX` with
`dist=negbin link=log` and `PROC PLM` for counterfactual predictions.

## Statistical background

### Negative binomial count model

For subject $i$ at visit $t$, let $Y_{it}$ denote the count endpoint.
The negative binomial (NB) model used throughout **gsDesignNB** is the
Gamma–Poisson mixture:

$$
\Lambda_i \sim \text{Gamma}\!\left(\frac{1}{k},\; \mu_i k\right),
\qquad
Y_{it} \mid \Lambda_i \sim \text{Poisson}(\Lambda_i).
$$

Marginally, $Y_{it}$ has mean $\mu_i$ and variance
$\text{Var}(Y_{it}) = \mu_i + k\,\mu_i^2$, where $k \geq 0$ is the
dispersion parameter. In the longitudinal GLMM, the mean is modelled as

$$
\log(\mu_{it}) = \mathbf{x}_{it}^{\top}\boldsymbol{\beta} + b_i,
$$

where $\mathbf{x}_{it}$ contains fixed-effect covariates (treatment, visit,
baseline, stratification factors) and $b_i$ is a subject-level random
intercept. The model is fitted via [glmmTMB::glmmTMB()] with
`family = nbinom2(link = "log")`.

### Imputation draw

All three model-based strategies draw imputed counts from the same
Gamma–Poisson compound:

$$
\lambda_i^{(m)} \sim \text{Gamma}\!\left(\frac{1}{k},\; \hat\mu_i \cdot k\right),
\qquad
Y_i^{(m)} \mid \lambda_i^{(m)} \sim \text{Poisson}\!\left(\lambda_i^{(m)}\right).
$$

The strategies differ only in how $\hat\mu_i$ is determined:

- **MAR**: $\hat\mu_i = \hat\mu_i^{\text{BLUP}}$ — predicted mean including
  subject random effects.
- **MNAR reference-based**: $\hat\mu_i = \hat\mu_i^{\text{FE, ref}} \times
  \dfrac{\hat\mu_i^{\text{BLUP}}}{\hat\mu_i^{\text{FE}}}$ — counterfactual
  mean under the reference treatment, adjusted by each subject's estimated
  random effect. The ratio $\hat\mu_i^{\text{BLUP}} / \hat\mu_i^{\text{FE}}$
  is the multiplicative random-effect "BLUP multiplier" computed from the
  original treatment assignment.
- **Composite**: $Y_i^{(m)} = y_{i,0}$ (baseline value), no model draw.

### Bootstrap–MI combination

When `n_boot > 1`, subjects are resampled with replacement within strata before
each GLMM fit. This *boot-MI* approach propagates both imputation uncertainty
and model-fitting uncertainty into the final variance estimate without
requiring Rubin's combining rules. When `n_boot = 1`, standard MI is performed
and Rubin's rules should be applied when pooling estimates across the `n_imp`
imputed datasets.

## Worked example

### Simulating longitudinal count data

We simulate a 3-arm trial (placebo and two active doses) with 4 post-baseline
visits, negative binomial counts (dispersion $k = 0.5$), and a single
binary stratification factor.

```{r simulate}
set.seed(2025)

n_subj   <- 90L   # 30 per arm
n_visit  <- 4L
k_true   <- 0.5   # dispersion
lambda   <- c(placebo = 1.8, low = 1.1, high = 0.7)   # mean counts/visit

trt_labels <- rep(c("placebo", "low", "high"), each = n_subj / 3L)
strat      <- sample(c(0L, 1L), n_subj, replace = TRUE)
baseline   <- rnbinom(n_subj, mu = 2.5, size = 1 / k_true)
id         <- seq_len(n_subj)

# Build long-format data (all visits complete at first)
long_data <- do.call(rbind, lapply(seq_len(n_subj), function(i) {
  mu_i <- lambda[trt_labels[i]] * exp(0.15 * strat[i] + 0.05 * baseline[i])
  data.frame(
    id       = i,
    visit    = seq_len(n_visit),
    trt      = trt_labels[i],
    strat1   = strat[i],
    baseline = baseline[i],
    count    = rnbinom(n_visit, mu = mu_i, size = 1 / k_true),
    stringsAsFactors = FALSE
  )
}))

# Sort
long_data <- long_data[order(long_data$id, long_data$visit), ]
rownames(long_data) <- NULL
str(long_data)
```

### Introducing missing data by mechanism

We introduce three types of missing data to mirror a realistic trial:

- **MAR**: A random subset of subjects miss visits 3–4 due to non-disease
  reasons (e.g., logistical dropout). The probability of dropout is unrelated
  to unobserved counts given the observed history — a plausible MAR assumption.
- **MNAR**: A subset of active-arm subjects discontinue at visit 2 due to
  adverse events likely linked to drug exposure. Their later counts are
  expected to be higher (on the reference arm trajectory) than for subjects
  who remained on treatment — a MNAR mechanism for which reference-based
  imputation is appropriate.
- **Composite ICE**: A small number of subjects in any arm experience a
  severe disease worsening event (e.g., hospitalisation) that leads to
  treatment discontinuation. The composite estimand treats this event as part
  of the outcome; missing post-event counts are filled with the baseline count.

```{r introduce_missing}
long_data$miss_flag <- NA_character_

# --- MAR: ~15% of subjects drop out from visit 3 onward ---
mar_subjects <- sample(id, size = round(0.15 * n_subj))
long_data$miss_flag[
  long_data$id %in% mar_subjects & long_data$visit >= 3
] <- "MAR"

# --- MNAR: ~10% of active-arm subjects withdraw at visit 2 ---
active_ids  <- id[trt_labels != "placebo"]
mnar_subjects <- sample(active_ids, size = round(0.10 * n_subj))
long_data$miss_flag[
  long_data$id %in% mnar_subjects & long_data$visit >= 2
] <- "MNAR"

# --- Composite ICE: ~5% of all subjects, post-event visits ---
comp_subjects <- sample(setdiff(id, c(mar_subjects, mnar_subjects)),
                        size = round(0.05 * n_subj))
comp_ice_visit <- sample(2L:4L, length(comp_subjects), replace = TRUE)
for (ci in seq_along(comp_subjects)) {
  long_data$miss_flag[
    long_data$id == comp_subjects[ci] &
    long_data$visit >= comp_ice_visit[ci]
  ] <- "Comp"
}

# Set the outcome to NA for flagged rows (simulate non-collection)
long_data$count[!is.na(long_data$miss_flag)] <- NA_integer_

cat("Missing rows by mechanism:\n")
print(table(long_data$miss_flag, useNA = "ifany"))
```

### Running `impute_nb()`

We now run the full MI pipeline. For this illustration we use `n_boot = 1`
(no bootstrap) and `n_imp = 5` imputations, fitting a random-intercept NB
GLMM.

```{r run_mi, eval=requireNamespace("glmmTMB", quietly = TRUE)}
library(gsDesignNB)

# The formula mirrors the fixed-effects structure from the SAS code.
# A random intercept per subject captures between-subject heterogeneity.
mi_formula <- count ~ baseline + strat1 + trt + visit + (1 | id)

imp_result <- impute_nb(
  data            = long_data,
  formula         = mi_formula,
  outcome_col     = "count",
  miss_flag_col   = "miss_flag",
  baseline_col    = "baseline",
  trt_col         = "trt",
  reference_trt   = "placebo",
  subject_col     = "id",
  strata_cols     = c("trt", "strat1"),
  mar_values      = "MAR",
  mnar_value      = "MNAR",
  composite_value = "Comp",
  n_imp           = 5L,
  n_boot          = 1L,
  seed            = 42L
)

# Dimensions: n_subj * n_visit * n_imp rows
dim(imp_result)
names(imp_result)
```

```{r run_mi_fallback, echo=FALSE, eval=!requireNamespace("glmmTMB", quietly = TRUE)}
message(
  "NOTE: 'glmmTMB' is not installed in this build environment.\n",
  "Install it with install.packages('glmmTMB') to run this vignette."
)
```

### Inspecting imputed values

The `imputed_value` column contains:

- The **observed value** for non-missing rows.
- An **imputed draw** for missing rows.

```{r inspect, eval=requireNamespace("glmmTMB", quietly = TRUE)}
# Show imputed rows only, first 2 imputations
imp_missing <- imp_result[
  !is.na(imp_result$miss_flag) & imp_result$imputation <= 2,
  c("id", "visit", "trt", "miss_flag", "baseline", "count", "imputation",
    "imputed_value")
]
head(imp_missing[order(imp_missing$miss_flag, imp_missing$id,
                        imp_missing$imputation), ], 20)
```

### Comparing imputed means by strategy

We can verify that the imputation behaved as expected. Under MAR, imputed
values should centre around the model prediction for each arm. Under
reference-based MNAR, imputed values for non-placebo subjects should look
more like placebo outcomes (higher counts).

```{r means_table, eval=requireNamespace("glmmTMB", quietly = TRUE)}
# Mean imputed value per mechanism and treatment arm
agg <- aggregate(
  imputed_value ~ miss_flag + trt,
  data   = imp_result[!is.na(imp_result$miss_flag), ],
  FUN    = mean
)
agg <- agg[order(agg$miss_flag, agg$trt), ]
print(agg, row.names = FALSE)
```

The MNAR reference-based imputed means for the `low` and `high` arms should
be closer to the `placebo` mean than the MAR-imputed means, reflecting the
"what if this subject had been on placebo?" counterfactual.

### Pooled analysis with Rubin's rules (standard MI)

After standard MI (`n_boot = 1`), pool estimates across the `n_imp` imputed
datasets using Rubin's rules. Here we compute the mean total count per arm at
visit 4 as a simple illustration.

```{r pooling, eval=requireNamespace("glmmTMB", quietly = TRUE)}
# Per-imputation mean at visit 4
v4 <- imp_result[imp_result$visit == 4, ]
per_imp <- lapply(split(v4, v4$imputation), function(d) {
  tapply(d$imputed_value, d$trt, mean, na.rm = TRUE)
})

# Point estimate: average over imputations
Q_bar <- Reduce("+", per_imp) / length(per_imp)

# Within-imputation variance (using single-imputation SE ~ mean/n, illustrative)
# In practice use regression SE from each imputed dataset
cat("Pooled mean imputed count at visit 4 by treatment arm:\n")
print(round(Q_bar, 2))
```

### Bootstrap–MI for variance estimation

When `n_boot > 1`, no Rubin's rules are needed. The variance across all
`n_boot * n_imp` completed datasets directly incorporates both sources of
uncertainty.

```{r boot_mi, eval=requireNamespace("glmmTMB", quietly = TRUE)}
imp_boot <- impute_nb(
  data            = long_data,
  formula         = mi_formula,
  outcome_col     = "count",
  miss_flag_col   = "miss_flag",
  baseline_col    = "baseline",
  trt_col         = "trt",
  reference_trt   = "placebo",
  subject_col     = "id",
  strata_cols     = c("trt", "strat1"),
  n_imp           = 5L,
  n_boot          = 3L,   # use larger values in practice (e.g., 100–500)
  seed            = 99L
)

# Each replicate × imputation combination is an independent completed dataset
cat("Total rows:", nrow(imp_boot),
    "= n_subj * n_visit * n_boot * n_imp =",
    n_subj * n_visit * 3L * 5L, "\n")

# Variance of the arm-level mean at visit 4 across all completed datasets
v4b <- imp_boot[imp_boot$visit == 4, ]
dataset_means <- tapply(
  v4b$imputed_value,
  list(paste(v4b$replicate, v4b$imputation, sep = "_"), v4b$trt),
  mean
)
cat("\nVariance of dataset-level means across boot×imp datasets:\n")
print(round(apply(dataset_means, 2, var), 4))
```

### Using the building-block functions directly

Advanced users can call the individual strategy functions for more control.

#### Step 1 — Fit the GLMM

```{r step1, eval=requireNamespace("glmmTMB", quietly = TRUE)}
obs_data <- long_data[!is.na(long_data$count), ]
fits <- fit_nb_glmm(
  data    = obs_data,
  formula = count ~ baseline + strat1 + trt + visit + (1 | id)
)
cat("Estimated dispersion k:", round(fits[["1"]]$k, 3), "\n")
```

#### Step 2 — MAR imputation

```{r step2, eval=requireNamespace("glmmTMB", quietly = TRUE)}
imp_mar <- impute_nb_mar(
  data          = long_data,
  fits          = fits,
  outcome_col   = "count",
  miss_flag_col = "miss_flag",
  mar_values    = c("MAR", "MNAR"),  # reference-arm MNAR treated as MAR
  n_imp         = 3L
)
cat("MAR imputed rows:", sum(!is.na(imp_mar$imputed_value[
  imp_mar$imputation == 1 & !is.na(imp_mar$miss_flag)
])), "\n")
```

#### Step 3 — MNAR reference-based imputation

```{r step3, eval=requireNamespace("glmmTMB", quietly = TRUE)}
imp_mnar <- impute_nb_mnar_ref(
  data          = long_data,
  fits          = fits,
  outcome_col   = "count",
  miss_flag_col = "miss_flag",
  mnar_value    = "MNAR",
  trt_col       = "trt",
  reference_trt = "placebo",
  n_imp         = 3L
)
```

#### Step 4 — Composite strategy (no model required)

```{r step4}
library(gsDesignNB)

# Composite fill can be applied to any data frame, no glmmTMB needed
df_comp_example <- data.frame(
  count         = c(3L,   NA,     NA,      5L),
  imputed_value = c(3L,    7L,    NA,      5L),
  miss_flag     = c(NA,  "MAR", "Comp",    NA),
  baseline      = c(4L,    4L,    4L,      6L)
)
impute_nb_composite(
  df_comp_example,
  outcome_col     = "count",
  miss_flag_col   = "miss_flag",
  composite_value = "Comp",
  baseline_col    = "baseline"
)
```

The third row (`miss_flag = "Comp"`) now has `imputed_value = 4` (the
baseline value); the second row (`miss_flag = "MAR"`) retains its
previously imputed value of 7.

## Key considerations

**Choice of random-effect structure.** The original SAS model uses an
unstructured residual covariance across visits within each
subject × endpoint combination. In `glmmTMB` this can be approximated with
`(0 + visit_factor | id:param)`, but convergence may be slow for small
datasets. Start with a random intercept `(1 | id)` and add complexity
incrementally.

**Treatment variable type.** If `trt_col` is a factor, ensure that
`reference_trt` matches a level in the factor. For counterfactual prediction
the reference level is inserted into the newdata frame; if new levels arise,
`allow.new.levels = TRUE` handles them gracefully.

**Composite and MNAR rows that are observed.** The composite strategy only
overwrites rows where `is.na(outcome_col)`. Observed rows with
`miss_flag = "Comp"` (e.g., the actual ICE visit itself, if observed) are
left unchanged.

**Bootstrap sample size.** For the boot-MI approach, `n_boot = 100`–`500`
replicates is typical in practice. The smaller values used in this vignette
are for illustration only.

**Pooling.** With standard MI (`n_boot = 1`), pool regression-model estimates
across imputations using Rubin's rules. With boot-MI (`n_boot > 1`), the
empirical variance across all `n_boot × n_imp` datasets is a valid variance
estimator without additional combining rules.

## Session info

```{r session_info}
sessionInfo()
```