Statistical Foundations
Source:vignettes/articles/statistical_foundations.Rmd
statistical_foundations.RmdThis article provides a comprehensive reference for the statistical
methods implemented in summata. It serves as a technical
companion to the package vignettes, detailing the mathematical
foundations, assumptions, and interpretation of each supported model
class.
For practical guidance on using these methods, see the Regression Modeling vignette.
Notation and Conventions
Throughout this article, we adopt the following notation:
| Symbol | Description |
|---|---|
| Y | Response (outcome) variable |
| X₁, …, Xₚ | Predictor (covariate) variables |
| β₀ | Intercept |
| β₁, …, βₚ | Regression coefficients |
| n | Sample size |
| i | Observation index (i = 1, …, n) |
| j | Cluster index (for hierarchical data) |
| t | Time (for survival models) |
| g(·) | Link function |
| exp(·) | Exponential function |
| log(·) | Natural logarithm |
Hypothesis Testing
Wald Test
Tests H₀: β = 0 using:
\[z = \frac{\hat{\beta}}{\text{SE}(\hat{\beta})}\]
Under H₀, z ~ N(0, 1) for GLM and Cox models. For linear models, the analogous statistic follows a t-distribution with n − p − 1 degrees of freedom.
Confidence Intervals
Wald Confidence Intervals
For coefficient β with standard error SE(β):
\[\beta \pm z_{1-\alpha/2} \cdot \text{SE}(\beta)\]
where z₁₋α/₂ is the standard normal quantile (1.96 for 95% CI). Wald intervals are computationally inexpensive but may be inaccurate when the likelihood surface is asymmetric, which can occur with small samples, sparse data, or estimates near boundary values.
Exact t-Distribution Intervals
For linear models, confidence intervals use the t-distribution to account for estimation of the residual variance:
\[\beta \pm t_{n-p-1, \; 1-\alpha/2} \cdot \text{SE}(\beta)\]
where tₙ₋ₚ₋₁ is the t-quantile with n − p − 1 degrees of freedom. This produces wider intervals than the normal approximation, particularly when n is small relative to p.
Profile Likelihood Intervals
Profile likelihood intervals are based on inverting the likelihood ratio test. The confidence set for β is defined as:
\[\{\beta : 2[\log L(\hat{\beta}) - \log L(\beta)] \leq \chi^2_{1,1-\alpha}\}\]
Profile intervals are more accurate than Wald intervals for small samples or boundary estimates, because they respect the curvature of the likelihood surface rather than approximating it as quadratic. However, they require iterative computation (refitting the model for each parameter), which increases execution time.
Confidence Intervals for Effect Measures
For exponentiated coefficients (OR, HR, RR), the CI is obtained by exponentiating the interval endpoints from the coefficient scale:
\[\left(\exp(\text{CI}_{\text{lower}}), \; \exp(\text{CI}_{\text{upper}})\right)\]
where CI_lower and CI_upper are the confidence limits for β on the log scale, computed using whichever method (Wald, exact t, or profile likelihood) is appropriate for the model class. Note that exponentiation preserves coverage probability because exp(·) is a monotone transformation.
Implementation
The CI method is selected automatically based on model class, and can
be overridden via the conf_method parameter:
| Model Class | Default (conf_method = "profile") |
With conf_method = "wald"
|
|---|---|---|
| GLM (binomial, poisson) | Profile likelihood via
MASS::confint.glm()
|
Wald (normal approximation) |
| Negative binomial | Profile likelihood via
MASS::confint.glm()
|
Wald (normal approximation) |
| Quasi-likelihood (quasibinomial, quasipoisson) | Wald (no true likelihood) | Wald (no true likelihood) |
Linear model (lm) |
Exact t-distribution via
confint.lm()
|
Wald (normal approximation) |
Cox PH (coxph) |
Wald | Wald |
Mixed-effects (lmer, glmer,
coxme) |
Wald | Wald |
Cox and mixed-effects models use Wald intervals regardless of the
conf_method setting. For Cox models, Wald CIs are the
standard convention in the survival analysis literature. For
mixed-effects models, profile likelihood CIs are available in principle
(via lme4::confint.merMod(method = "profile")), but are
prohibitively slow for routine use and are not implemented in
summata.
The conf_method can be set globally for a session:
options(summata.conf_method = "profile") # Profile/exact-t (default)
options(summata.conf_method = "wald") # Wald CIs throughoutLinear Regression
Model Specification
Linear regression models the expected value of a continuous response as a linear combination of predictors:
\[E[Y_i | X_i] = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip}\]
Equivalently, in matrix notation:
\[\mathbf{Y} = \mathbf{X}\mathbf{\beta} + \mathbf{\varepsilon}\]
where \(\mathbf{\varepsilon} \sim N(\mathbf{0}, \sigma^2\mathbf{I})\).
Assumptions
- Linearity: The relationship between predictors and response is linear
- Independence: Observations are independent
- Homoscedasticity: Constant error variance across all levels of predictors
- Normality: Errors are normally distributed (for inference)
- No multicollinearity: Predictors are not perfectly correlated
Parameter Estimation
Coefficients are estimated by ordinary least squares (OLS):
\[\hat{\mathbf{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]
Interpretation
Each coefficient βⱼ represents the expected change in Y for a one-unit increase in Xⱼ, holding all other predictors constant.
| Predictor Type | Interpretation of β |
|---|---|
| Continuous | Change in Y per unit increase in X |
| Binary (0/1) | Difference in Y between groups |
| Categorical | Difference from reference category |
Goodness of Fit
Coefficient of determination (R²): \[R^2 = 1 - \frac{\sum_i (Y_i - \hat{Y}_i)^2}{\sum_i (Y_i - \bar{Y})^2}\]
Adjusted R²: \[R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}\]
Root mean squared error (RMSE): \[\text{RMSE} = \sqrt{\frac{1}{n}\sum_i (Y_i - \hat{Y}_i)^2}\]
Note: this is the descriptive (biased) RMSE. The unbiased estimate of
the residual standard deviation σ uses n − p − 1 in
the denominator and is reported by summary.lm() as the
“residual standard error.”
Implementation
In summata, linear regression is specified with
model_type = "lm":
fit(data, outcome = "continuous_var", predictors, model_type = "lm")Generalized Linear Models
Generalized linear models (GLMs) extend linear regression to accommodate non-normal response distributions through three components:
- Random component: Distribution of Y from the exponential family
- Systematic component: Linear predictor \(\eta = \mathbf{X}\mathbf{\beta}\)
- Link function: \(g(\mu) = \eta\), where \(\mu = E[Y]\)
Logistic Regression
Model Specification
For binary outcomes Y ∈ {0, 1}, logistic regression models the log-odds of the event:
\[\log\left(\frac{P(Y_i = 1 | X_i)}{1 - P(Y_i = 1 | X_i)}\right) = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip}\]
Equivalently: \[P(Y_i = 1 | X_i) = \frac{\exp(\beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip})}{1 + \exp(\beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip})}\]
Assumptions
- Binary outcome: Y takes values 0 or 1
- Independence: Observations are independent
- Linearity in log-odds: Log-odds is a linear function of predictors
- No multicollinearity: Predictors are not perfectly correlated
- Large sample: Adequate events per predictor (≥10 rule of thumb)
Interpretation: Odds Ratios
The exponentiated coefficient exp(βⱼ) yields the odds ratio (OR):
\[\text{OR} = \exp(\beta_j) = \frac{\text{Odds}(Y = 1 | X_j + 1)}{\text{Odds}(Y = 1 | X_j)}\]
| OR Value | Interpretation |
|---|---|
| OR = 1 | No association |
| OR > 1 | Increased odds of outcome |
| OR < 1 | Decreased odds of outcome |
For a continuous predictor, the OR represents the multiplicative change in odds per one-unit increase. For a categorical predictor, the OR compares each level to the reference category.
Implementation
fit(data, outcome = "binary_var", predictors, model_type = "glm", family = "binomial")Poisson Regression
Model Specification
For count outcomes Y ∈ {0, 1, 2, …}, Poisson regression models the log of the expected count:
\[\log(E[Y_i | X_i]) = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip}\]
Assumptions
- Count outcome: Y is a non-negative integer
- Independence: Observations are independent
- Equidispersion: Mean equals variance (E[Y] = Var(Y))
- Log-linear relationship: Log of expected count is linear in predictors
Interpretation: Rate Ratios
The exponentiated coefficient exp(βⱼ) yields the rate ratio (RR):
\[\text{RR} = \exp(\beta_j) = \frac{E[Y | X_j + 1]}{E[Y | X_j]}\]
Negative Binomial Regression
Model Specification
Negative binomial regression extends Poisson regression with an additional dispersion parameter θ:
\[Y_i \sim \text{NegBin}(\mu_i, \theta)\] \[\log(\mu_i) = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip}\]
The variance is: \[\text{Var}(Y_i) = \mu_i + \frac{\mu_i^2}{\theta}\]
As θ → ∞, the negative binomial converges to Poisson.
When to Use
- Residual deviance >> residual degrees of freedom in Poisson model
- Variance substantially exceeds mean
- Overdispersion test is significant
Implementation
fit(data, outcome = "count_var", predictors, model_type = "negbin")Requires the MASS package.
Gamma Regression
Model Specification
For positive, continuous, right-skewed outcomes, Gamma regression models:
\[Y_i \sim \text{Gamma}(\mu_i, \phi)\]
The canonical link for the Gamma family is the inverse (1/μ). In
practice, the log link is often preferred because it yields more
interpretable coefficients (multiplicative effects). When
family = "Gamma" is passed as a string,
summata defaults to the log link. The link can also be
specified explicitly: \[\log(E[Y_i | X_i]) =
\beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip}\]
Assumptions
- Positive outcome: Y > 0
- Gamma distribution: Right-skewed, variance proportional to mean squared
- Independence: Observations are independent
Interpretation
With log link, exp(βⱼ) represents the multiplicative effect (ratio):
\[\frac{E[Y | X_j + 1]}{E[Y | X_j]} = \exp(\beta_j)\]
Common Applications
- Healthcare costs
- Length of hospital stay
- Biomarker concentrations
- Time durations
Implementation
# String shorthand (resolves to log link)
fit(data, outcome = "positive_continuous", predictors, model_type = "glm", family = "Gamma")
# Explicit link specification
fit(data, outcome = "positive_continuous", predictors, model_type = "glm", family = Gamma(link = "log"))
# Gamma with inverse link (canonical)
fit(data, outcome = "positive_continuous", predictors, model_type = "glm", family = Gamma(link = "inverse"))Quasi-Likelihood Models
Summary of GLM Families
| Family | Outcome Type | Link | Effect Measure |
|---|---|---|---|
binomial |
Binary | logit | Odds ratio |
quasibinomial |
Binary (overdispersed) | logit | Odds ratio |
poisson |
Count | log | Rate ratio |
quasipoisson |
Count (overdispersed) | log | Rate ratio |
gaussian |
Continuous | identity | Coefficient |
Gamma |
Positive continuous | inverse (canonical); log (in summata) |
Ratio (with log link) |
inverse.gaussian |
Positive continuous | 1/μ² | — |
Survival Analysis
Cox Proportional Hazards Model
Model Specification
The Cox model relates the hazard function to predictors:
\[h(t | X_i) = h_0(t) \exp(\beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip})\]
where:
- h(t | X) is the hazard (instantaneous risk) at time t
- h₀(t) is the baseline hazard (unspecified)
- exp(βⱼ) is the hazard ratio for predictor j
Assumptions
- Proportional hazards: Hazard ratios are constant over time
- Independence: Observations are independent (or appropriately modeled)
- Censoring: Non-informative censoring
- No multicollinearity: Predictors are not perfectly correlated
Interpretation: Hazard Ratios
The hazard ratio (HR) = exp(βⱼ) represents the relative hazard:
\[\text{HR} = \frac{h(t | X_j + 1)}{h(t | X_j)} = \exp(\beta_j)\]
| HR Value | Interpretation |
|---|---|
| HR = 1 | No association with survival |
| HR > 1 | Increased hazard (shorter survival) |
| HR < 1 | Decreased hazard (longer survival) |
Goodness of Fit
Concordance (Harrell’s C-index): \[C = P(\hat{h}_i > \hat{h}_j | T_i < T_j)\]
Proportion of concordant pairs; C = 0.5 indicates no discrimination, C = 1 indicates perfect discrimination.
Testing Proportional Hazards
The proportional hazards assumption can be tested using:
- Scaled Schoenfeld residuals
- Log-log survival plots
- Time-dependent covariates
Implementation
fit(data, outcome = "Surv(time, status)", predictors, model_type = "coxph")Requires the survival package.
Conditional Logistic Regression
Model Specification
For matched case-control studies, conditional logistic regression conditions on the stratum:
\[\log\left(\frac{P(Y_i = 1 | X_i, \text{stratum})}{P(Y_i = 0 | X_i, \text{stratum})}\right) = \beta_1 X_{i1} + \cdots + \beta_p X_{ip}\]
Note: No intercept is estimated; it is absorbed by the stratum.
When to Use
- 1:1 or 1:m matched case-control designs
- Stratified analyses where stratum-specific effects are nuisance parameters
Implementation
fit(data, outcome = "case_status", predictors, model_type = "clogit", strata = "match_id")Requires the survival package.
Mixed-Effects Models
Mixed-effects models (hierarchical or multilevel models) account for correlation within clusters by incorporating random effects.
General Framework
For observation i within cluster j:
\[g(E[Y_{ij} | X_{ij}, u_j]) = \mathbf{X}_{ij}\mathbf{\beta} + \mathbf{Z}_{ij}\mathbf{u}_j\]
where:
- β are fixed effects (common across all clusters)
- uⱼ ~ N(0, Σ) are random effects (cluster-specific deviations)
- Zᵢⱼ is the random effects design matrix
Linear Mixed-Effects Models
Model Specification
For continuous outcomes:
\[Y_{ij} = \beta_0 + \beta_1 X_{1ij} + \cdots + \beta_p X_{pij} + u_{0j} + u_{1j}X_{1ij} + \varepsilon_{ij}\]
where:
- u₀ⱼ ~ N(0, σ²ᵤ₀) is the random intercept for cluster j
- u₁ⱼ ~ N(0, σ²ᵤ₁) is the random slope (if included)
- εᵢⱼ ~ N(0, σ²) is the residual error
Generalized Linear Mixed-Effects Models
Model Specification
GLMMs extend mixed-effects models to non-normal outcomes:
\[g(E[Y_{ij} | X_{ij}, u_j]) = \beta_0 + \beta_1 X_{1ij} + \cdots + \beta_p X_{pij} + u_{0j}\]
Common families:
- Binomial: Clustered binary outcomes
- Poisson: Clustered count outcomes
Cox Mixed-Effects Models
Model Specification
For clustered survival data:
\[h_{ij}(t) = h_0(t) \exp(\beta_1 X_{1ij} + \cdots + \beta_p X_{pij} + u_j)\]
where uⱼ ~ N(0, σ²ᵤ) is the cluster-specific frailty term.
Interaction Effects
Model Specification
An interaction occurs when the effect of one predictor depends on the level of another. For predictors X and Z:
\[g(E[Y]) = \beta_0 + \beta_1 X + \beta_2 Z + \beta_3 (X \times Z)\]
where:
- β₁ is the effect of X when Z = 0
- β₂ is the effect of Z when X = 0
- β₃ is the interaction effect (modification of X effect by Z)
Interpretation
Continuous × Continuous
For continuous X and Z, the effect of X on Y is:
\[\frac{\partial E[Y]}{\partial X} = \beta_1 + \beta_3 Z\]
The effect of X varies linearly with Z.
Testing Interactions
- Wald test: Test H₀: β₃ = 0
- Likelihood ratio test: Compare models with and without interaction
-
Model comparison: Use
compfit()to compare AIC/BIC
Model Quality Metrics
Information Criteria
Akaike Information Criterion (AIC)
\[\text{AIC} = -2\log L + 2k\]
where L is the maximized likelihood and k is the number of parameters. AIC balances fit and complexity; lower values indicate better models.
Discrimination Metrics
Explained Variation
R² (Linear Models)
\[R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}\]
Proportion of variance explained.
Pseudo-R² (GLMs)
McFadden pseudo-R²: \[R^2_{\text{McF}} = 1 - \frac{\log L_{\text{full}}}{\log L_{\text{null}}}\]
Nagelkerke pseudo-R²: \[R^2_N = \frac{1 - (L_{\text{null}}/L_{\text{full}})^{2/n}}{1 - L_{\text{null}}^{2/n}}\]
Values are not directly comparable to linear R²; 0.2–0.4 typically indicates good fit for logistic models.
Composite Model Score (CMS)
Rationale
When comparing multiple candidate models, researchers typically examine several quality metrics: information criteria (AIC, BIC), discrimination (concordance), calibration (Brier score), and explained variation (R² or pseudo-R²). Each metric captures a different aspect of model performance, and no single measure is universally optimal for model selection.
The Composite Model Score (CMS) addresses this challenge by combining multiple metrics into a single summary value, facilitating rapid comparison across candidate models. This approach:
- Synthesizes complementary information from diverse metrics
- Provides model-type-specific weighting based on statistical conventions
- Normalizes metrics to a common 0–100 scale for interpretability
- Accounts for convergence issues that may invalidate model results
The CMS is intended as a screening tool to identify promising models, not as a replacement for substantive evaluation. Final model selection should always consider domain knowledge, parsimony, and the specific research objectives.
Methodology
Normalization
Each component metric is transformed to a 0–100 scale where higher values indicate better performance:
Information criteria (AIC, BIC) — Lower is better: \[S_{\text{AIC}} = 100 \times \left(1 - \frac{\text{AIC}_i - \text{AIC}_{\min}}{\text{AIC}_{\max} - \text{AIC}_{\min}}\right)\]
Concordance (C-statistic) — Higher is better, with 0.5 as floor: \[S_C = \begin{cases} 0 & C \leq 0.5 \\ 200 \times (C - 0.5) & C > 0.5 \end{cases}\]
Pseudo-R² (McFadden) — Scaled recognizing typical range: \[S_{R^2} = \min(100, \; 250 \times R^2_{\text{McF}})\]
This scaling reflects that McFadden’s R² rarely exceeds 0.4 even for well-fitting models.
Brier score — Lower is better, with 0.25 as no-skill baseline: \[S_{\text{Brier}} = 100 \times \left(1 - \frac{\text{Brier}}{0.25}\right)\]
Convergence — Categorical assessment: \[S_{\text{conv}} = \begin{cases} 100 & \text{converged normally} \\ 70 & \text{suspect (warnings)} \\ 30 & \text{failed to converge} \end{cases}\]
Weighting
Component scores are combined using model-type-specific weights that reflect the relative importance of each metric for that model class:
| Model Type | Convergence | AIC | Concordance | Pseudo-R² | Brier | Global p |
|---|---|---|---|---|---|---|
| GLM (logistic) | 0.15 | 0.25 | 0.40 | 0.15 | 0.05 | — |
| Cox PH | 0.15 | 0.30 | 0.40 | — | — | 0.15 |
| Linear | 0.15 | 0.25 | — | 0.45 | — | 0.15 |
For mixed-effects models, additional components include marginal R², conditional R², and ICC.
Interpretation
| CMS Range | Interpretation |
|---|---|
| 85–100 | Excellent model quality |
| 75–84 | Very good |
| 65–74 | Good |
| 55–64 | Fair |
| < 55 | Poor |
These thresholds provide general guidance. A model with CMS = 80 is not necessarily “better” than one with CMS = 78—small differences should prompt examination of individual metrics and substantive considerations.
Limitations
- Relative ranking: CMS values are relative to the set of models being compared; they do not indicate absolute model quality
- Weight sensitivity: Results depend on the chosen weights; users with strong priors about metric importance can supply custom weights
- Metric availability: Not all metrics are available for all model types; weights are renormalized when components are missing
- No substitute for judgment: The CMS facilitates screening but does not replace careful model diagnostics and domain expertise
Implementation
# Compare candidate models
comparison <- compfit(
data = mydata,
outcome = "y",
predictors = list(
"Model 1" = c("x1", "x2"),
"Model 2" = c("x1", "x2", "x3"),
"Model 3" = c("x1", "x2", "x3", "x4")
),
model_type = "logistic"
)
# Custom weights emphasizing discrimination
comparison_custom <- compfit(
data = mydata,
outcome = "y",
predictors = model_list,
model_type = "logistic",
scoring_weights = list(
convergence = 0.10,
aic = 0.20,
concordance = 0.50,
pseudo_r2 = 0.15,
brier = 0.05
)
)References
Methodological
- Agresti, A. (2013). Categorical Data Analysis (3rd ed.). Wiley.
- Harrell, F. E. (2015). Regression Modeling Strategies (2nd ed.). Springer.
- Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
- Therneau, T. M., & Grambsch, P. M. (2000). Modeling Survival Data. Springer.
- Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
Software
- R Core Team (2025). R: A Language and Environment for Statistical Computing. https://www.R-project.org/
- Therneau, T. (2025). survival: Survival Analysis. R package. https://CRAN.R-project.org/package=survival
- Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.
- Therneau, T. (2025). coxme: Cox Mixed-Effects Models. R package. https://CRAN.R-project.org/package=coxme
- Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). Springer. (MASS package)
See Also
-
Regression Modeling:
Practical examples using
fit(),uniscreen(), andfullfit() - Advanced Workflows: Interactions and mixed-effects models
-
Model Comparison: Comparing
alternative specifications with
compfit() - Forest Plots: Visualization of regression results