Statistical Foundations
Source:vignettes/articles/statistical_foundations.Rmd
statistical_foundations.RmdThis article provides a comprehensive reference for the statistical
methods implemented in summata. It serves as a technical
companion to the package vignettes, detailing the mathematical
foundations, assumptions, and interpretation of each supported model
class.
For practical guidance on using these methods, see the Regression Modeling vignette.
Notation and Conventions
Throughout this article, we adopt the following notation:
| Symbol | Description |
|---|---|
| Y | Response (outcome) variable |
| X₁, …, Xₚ | Predictor (covariate) variables |
| β₀ | Intercept |
| β₁, …, βₚ | Regression coefficients |
| n | Sample size |
| i | Observation index (i = 1, …, n) |
| j | Cluster index (for hierarchical data) |
| t | Time (for survival models) |
| g(·) | Link function |
| exp(·) | Exponential function |
| log(·) | Natural logarithm |
Hypothesis Testing
Wald Test
Tests H₀: β = 0 using:
\[z = \frac{\hat{\beta}}{\text{SE}(\hat{\beta})}\]
Under H₀, z ~ N(0, 1).
Confidence Intervals
Wald Confidence Intervals
For coefficient β with standard error SE(β):
\[\beta \pm z_{1-\alpha/2} \cdot \text{SE}(\beta)\]
where z₁₋α/₂ is the standard normal quantile (1.96 for 95% CI).
Linear Regression
Model Specification
Linear regression models the expected value of a continuous response as a linear combination of predictors:
\[E[Y_i | X_i] = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip}\]
Equivalently, in matrix notation:
\[\mathbf{Y} = \mathbf{X}\mathbf{\beta} + \mathbf{\varepsilon}\]
where \(\mathbf{\varepsilon} \sim N(\mathbf{0}, \sigma^2\mathbf{I})\).
Assumptions
- Linearity: The relationship between predictors and response is linear
- Independence: Observations are independent
- Homoscedasticity: Constant error variance across all levels of predictors
- Normality: Errors are normally distributed (for inference)
- No multicollinearity: Predictors are not perfectly correlated
Parameter Estimation
Coefficients are estimated by ordinary least squares (OLS):
\[\hat{\mathbf{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]
Interpretation
Each coefficient βⱼ represents the expected change in Y for a one-unit increase in Xⱼ, holding all other predictors constant.
| Predictor Type | Interpretation of β |
|---|---|
| Continuous | Change in Y per unit increase in X |
| Binary (0/1) | Difference in Y between groups |
| Categorical | Difference from reference category |
Goodness of Fit
Coefficient of determination (R²): \[R^2 = 1 - \frac{\sum_i (Y_i - \hat{Y}_i)^2}{\sum_i (Y_i - \bar{Y})^2}\]
Adjusted R²: \[R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}\]
Root mean squared error (RMSE): \[\text{RMSE} = \sqrt{\frac{1}{n}\sum_i (Y_i - \hat{Y}_i)^2}\]
Implementation
In summata, linear regression is specified with
model_type = "lm":
fit(data, outcome = "continuous_var", predictors, model_type = "lm")Generalized Linear Models
Generalized linear models (GLMs) extend linear regression to accommodate non-normal response distributions through three components:
- Random component: Distribution of Y from the exponential family
- Systematic component: Linear predictor \(\eta = \mathbf{X}\mathbf{\beta}\)
- Link function: \(g(\mu) = \eta\), where \(\mu = E[Y]\)
Logistic Regression
Model Specification
For binary outcomes Y ∈ {0, 1}, logistic regression models the log-odds of the event:
\[\log\left(\frac{P(Y_i = 1 | X_i)}{1 - P(Y_i = 1 | X_i)}\right) = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip}\]
Equivalently: \[P(Y_i = 1 | X_i) = \frac{\exp(\beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip})}{1 + \exp(\beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip})}\]
Assumptions
- Binary outcome: Y takes values 0 or 1
- Independence: Observations are independent
- Linearity in log-odds: Log-odds is a linear function of predictors
- No multicollinearity: Predictors are not perfectly correlated
- Large sample: Adequate events per predictor (≥10 rule of thumb)
Interpretation: Odds Ratios
The exponentiated coefficient exp(βⱼ) yields the odds ratio (OR):
\[\text{OR} = \exp(\beta_j) = \frac{\text{Odds}(Y = 1 | X_j + 1)}{\text{Odds}(Y = 1 | X_j)}\]
| OR Value | Interpretation |
|---|---|
| OR = 1 | No association |
| OR > 1 | Increased odds of outcome |
| OR < 1 | Decreased odds of outcome |
For a continuous predictor, the OR represents the multiplicative change in odds per one-unit increase. For a categorical predictor, the OR compares each level to the reference category.
Implementation
fit(data, outcome = "binary_var", predictors, model_type = "glm", family = "binomial")Poisson Regression
Model Specification
For count outcomes Y ∈ {0, 1, 2, …}, Poisson regression models the log of the expected count:
\[\log(E[Y_i | X_i]) = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip}\]
Assumptions
- Count outcome: Y is a non-negative integer
- Independence: Observations are independent
- Equidispersion: Mean equals variance (E[Y] = Var(Y))
- Log-linear relationship: Log of expected count is linear in predictors
Interpretation: Rate Ratios
The exponentiated coefficient exp(βⱼ) yields the rate ratio (RR):
\[\text{RR} = \exp(\beta_j) = \frac{E[Y | X_j + 1]}{E[Y | X_j]}\]
Negative Binomial Regression
Model Specification
Negative binomial regression extends Poisson regression with an additional dispersion parameter θ:
\[Y_i \sim \text{NegBin}(\mu_i, \theta)\] \[\log(\mu_i) = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip}\]
The variance is: \[\text{Var}(Y_i) = \mu_i + \frac{\mu_i^2}{\theta}\]
As θ → ∞, the negative binomial converges to Poisson.
When to Use
- Residual deviance >> residual degrees of freedom in Poisson model
- Variance substantially exceeds mean
- Overdispersion test is significant
Implementation
fit(data, outcome = "count_var", predictors, model_type = "negbin")Requires the MASS package.
Gamma Regression
Model Specification
For positive, continuous, right-skewed outcomes, Gamma regression models:
\[Y_i \sim \text{Gamma}(\mu_i, \phi)\]
With log link (default): \[\log(E[Y_i | X_i]) = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip}\]
Assumptions
- Positive outcome: Y > 0
- Gamma distribution: Right-skewed, variance proportional to mean squared
- Independence: Observations are independent
Interpretation
With log link, exp(βⱼ) represents the multiplicative effect (ratio):
\[\frac{E[Y | X_j + 1]}{E[Y | X_j]} = \exp(\beta_j)\]
Common Applications
- Healthcare costs
- Length of hospital stay
- Biomarker concentrations
- Time durations
Implementation
fit(data, outcome = "positive_continuous", predictors, model_type = "glm", family = "Gamma")Quasi-Likelihood Models
Summary of GLM Families
| Family | Outcome Type | Link | Effect Measure |
|---|---|---|---|
binomial |
Binary | logit | Odds ratio |
quasibinomial |
Binary (overdispersed) | logit | Odds ratio |
poisson |
Count | log | Rate ratio |
quasipoisson |
Count (overdispersed) | log | Rate ratio |
gaussian |
Continuous | identity | Coefficient |
Gamma |
Positive continuous | log | Ratio |
inverse.gaussian |
Positive continuous | 1/μ² | — |
Survival Analysis
Cox Proportional Hazards Model
Model Specification
The Cox model relates the hazard function to predictors:
\[h(t | X_i) = h_0(t) \exp(\beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip})\]
where:
- h(t | X) is the hazard (instantaneous risk) at time t
- h₀(t) is the baseline hazard (unspecified)
- exp(βⱼ) is the hazard ratio for predictor j
Assumptions
- Proportional hazards: Hazard ratios are constant over time
- Independence: Observations are independent (or appropriately modeled)
- Censoring: Non-informative censoring
- No multicollinearity: Predictors are not perfectly correlated
Interpretation: Hazard Ratios
The hazard ratio (HR) = exp(βⱼ) represents the relative hazard:
\[\text{HR} = \frac{h(t | X_j + 1)}{h(t | X_j)} = \exp(\beta_j)\]
| HR Value | Interpretation |
|---|---|
| HR = 1 | No association with survival |
| HR > 1 | Increased hazard (shorter survival) |
| HR < 1 | Decreased hazard (longer survival) |
Goodness of Fit
Concordance (Harrell’s C-index): \[C = P(\hat{h}_i > \hat{h}_j | T_i < T_j)\]
Proportion of concordant pairs; C = 0.5 indicates no discrimination, C = 1 indicates perfect discrimination.
Testing Proportional Hazards
The proportional hazards assumption can be tested using:
- Scaled Schoenfeld residuals
- Log-log survival plots
- Time-dependent covariates
Implementation
fit(data, outcome = "Surv(time, status)", predictors, model_type = "coxph")Requires the survival package.
Conditional Logistic Regression
Model Specification
For matched case-control studies, conditional logistic regression conditions on the stratum:
\[\log\left(\frac{P(Y_i = 1 | X_i, \text{stratum})}{P(Y_i = 0 | X_i, \text{stratum})}\right) = \beta_1 X_{i1} + \cdots + \beta_p X_{ip}\]
Note: No intercept is estimated; it is absorbed by the stratum.
When to Use
- 1:1 or 1:m matched case-control designs
- Stratified analyses where stratum-specific effects are nuisance parameters
Implementation
fit(data, outcome = "case_status", predictors, model_type = "clogit", strata = "match_id")Requires the survival package.
Mixed-Effects Models
Mixed-effects models (hierarchical or multilevel models) account for correlation within clusters by incorporating random effects.
General Framework
For observation i within cluster j:
\[g(E[Y_{ij} | X_{ij}, u_j]) = \mathbf{X}_{ij}\mathbf{\beta} + \mathbf{Z}_{ij}\mathbf{u}_j\]
where:
- β are fixed effects (population-averaged)
- uⱼ ~ N(0, Σ) are random effects (cluster-specific deviations)
- Zᵢⱼ is the random effects design matrix
Linear Mixed-Effects Models
Model Specification
For continuous outcomes:
\[Y_{ij} = \beta_0 + \beta_1 X_{1ij} + \cdots + \beta_p X_{pij} + u_{0j} + u_{1j}X_{1ij} + \varepsilon_{ij}\]
where:
- u₀ⱼ ~ N(0, σ²ᵤ₀) is the random intercept for cluster j
- u₁ⱼ ~ N(0, σ²ᵤ₁) is the random slope (if included)
- εᵢⱼ ~ N(0, σ²) is the residual error
Generalized Linear Mixed-Effects Models
Model Specification
GLMMs extend mixed-effects models to non-normal outcomes:
\[g(E[Y_{ij} | X_{ij}, u_j]) = \beta_0 + \beta_1 X_{1ij} + \cdots + \beta_p X_{pij} + u_{0j}\]
Common families:
- Binomial: Clustered binary outcomes
- Poisson: Clustered count outcomes
Cox Mixed-Effects Models
Model Specification
For clustered survival data:
\[h_{ij}(t) = h_0(t) \exp(\beta_1 X_{1ij} + \cdots + \beta_p X_{pij} + u_j)\]
where uⱼ ~ N(0, σ²ᵤ) is the cluster-specific frailty term.
Interaction Effects
Model Specification
An interaction occurs when the effect of one predictor depends on the level of another. For predictors X and Z:
\[g(E[Y]) = \beta_0 + \beta_1 X + \beta_2 Z + \beta_3 (X \times Z)\]
where:
- β₁ is the effect of X when Z = 0
- β₂ is the effect of Z when X = 0
- β₃ is the interaction effect (modification of X effect by Z)
Interpretation
Continuous × Continuous
For continuous X and Z, the effect of X on Y is:
\[\frac{\partial E[Y]}{\partial X} = \beta_1 + \beta_3 Z\]
The effect of X varies linearly with Z.
Testing Interactions
- Wald test: Test H₀: β₃ = 0
- Likelihood ratio test: Compare models with and without interaction
-
Model comparison: Use
compfit()to compare AIC/BIC
Model Quality Metrics
Information Criteria
Akaike Information Criterion (AIC)
\[\text{AIC} = -2\log L + 2k\]
where L is the maximized likelihood and k is the number of parameters. AIC balances fit and complexity; lower values indicate better models.
Discrimination Metrics
Explained Variation
R² (Linear Models)
\[R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}\]
Proportion of variance explained.
Pseudo-R² (GLMs)
McFadden’s pseudo-R²: \[R^2_{\text{McF}} = 1 - \frac{\log L_{\text{full}}}{\log L_{\text{null}}}\]
Nagelkerke’s pseudo-R²: \[R^2_N = \frac{1 - (L_{\text{null}}/L_{\text{full}})^{2/n}}{1 - L_{\text{null}}^{2/n}}\]
Values are not directly comparable to linear R²; 0.2–0.4 typically indicates good fit for logistic models.
Composite Model Score (CMS)
Rationale
When comparing multiple candidate models, researchers typically examine several quality metrics: information criteria (AIC, BIC), discrimination (concordance), calibration (Brier score), and explained variation (R² or pseudo-R²). Each metric captures a different aspect of model performance, and no single measure is universally optimal for model selection.
The Composite Model Score (CMS) addresses this challenge by combining multiple metrics into a single summary value, facilitating rapid comparison across candidate models. This approach:
- Synthesizes complementary information from diverse metrics
- Provides model-type-specific weighting based on statistical conventions
- Normalizes metrics to a common 0–100 scale for interpretability
- Accounts for convergence issues that may invalidate model results
The CMS is intended as a screening tool to identify promising models, not as a replacement for substantive evaluation. Final model selection should always consider domain knowledge, parsimony, and the specific research objectives.
Methodology
Normalization
Each component metric is transformed to a 0–100 scale where higher values indicate better performance:
Information criteria (AIC, BIC) — Lower is better: \[S_{\text{AIC}} = 100 \times \left(1 - \frac{\text{AIC}_i - \text{AIC}_{\min}}{\text{AIC}_{\max} - \text{AIC}_{\min}}\right)\]
Concordance (C-statistic) — Higher is better, with 0.5 as floor: \[S_C = \begin{cases} 0 & C \leq 0.5 \\ 200 \times (C - 0.5) & C > 0.5 \end{cases}\]
Pseudo-R² (McFadden) — Scaled recognizing typical range: \[S_{R^2} = \min(100, \; 250 \times R^2_{\text{McF}})\]
This scaling reflects that McFadden’s R² rarely exceeds 0.4 even for well-fitting models.
Brier score — Lower is better, with 0.25 as no-skill baseline: \[S_{\text{Brier}} = 100 \times \left(1 - \frac{\text{Brier}}{0.25}\right)\]
Convergence — Categorical assessment: \[S_{\text{conv}} = \begin{cases} 100 & \text{converged normally} \\ 70 & \text{suspect (warnings)} \\ 30 & \text{failed to converge} \end{cases}\]
Weighting
Component scores are combined using model-type-specific weights that reflect the relative importance of each metric for that model class:
| Model Type | Convergence | AIC | Concordance | Pseudo-R² | Brier | Global p |
|---|---|---|---|---|---|---|
| GLM (logistic) | 0.15 | 0.25 | 0.40 | 0.15 | 0.05 | — |
| Cox PH | 0.15 | 0.30 | 0.40 | — | — | 0.15 |
| Linear | 0.15 | 0.25 | — | 0.45 | — | 0.15 |
For mixed-effects models, additional components include marginal R², conditional R², and ICC.
Interpretation
| CMS Range | Interpretation |
|---|---|
| 85–100 | Excellent model quality |
| 75–84 | Very good |
| 65–74 | Good |
| 55–64 | Fair |
| < 55 | Poor |
These thresholds provide general guidance. A model with CMS = 80 is not necessarily “better” than one with CMS = 78—small differences should prompt examination of individual metrics and substantive considerations.
Limitations
- Relative ranking: CMS values are relative to the set of models being compared; they do not indicate absolute model quality
- Weight sensitivity: Results depend on the chosen weights; users with strong priors about metric importance can supply custom weights
- Metric availability: Not all metrics are available for all model types; weights are renormalized when components are missing
- No substitute for judgment: The CMS facilitates screening but does not replace careful model diagnostics and domain expertise
Implementation
# Compare candidate models
comparison <- compfit(
data = mydata,
outcome = "y",
predictors = list(
"Model 1" = c("x1", "x2"),
"Model 2" = c("x1", "x2", "x3"),
"Model 3" = c("x1", "x2", "x3", "x4")
),
model_type = "logistic"
)
# Custom weights emphasizing discrimination
comparison_custom <- compfit(
data = mydata,
outcome = "y",
predictors = model_list,
model_type = "logistic",
scoring_weights = list(
convergence = 0.10,
aic = 0.20,
concordance = 0.50,
pseudo_r2 = 0.15,
brier = 0.05
)
)References
Methodological
- Agresti, A. (2013). Categorical Data Analysis (3rd ed.). Wiley.
- Harrell, F. E. (2015). Regression Modeling Strategies (2nd ed.). Springer.
- Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
- Therneau, T. M., & Grambsch, P. M. (2000). Modeling Survival Data. Springer.
- Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
Software
- R Core Team (2024). R: A Language and Environment for Statistical Computing. https://www.R-project.org/
- Therneau, T. (2024). survival: Survival Analysis. R package. https://CRAN.R-project.org/package=survival
- Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.
- Therneau, T. (2024). coxme: Cox Mixed-Effects Models. R package. https://CRAN.R-project.org/package=coxme
- Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). Springer. (MASS package)
See Also
-
Regression Modeling:
Practical examples using
fit(),uniscreen(), andfullfit() - Advanced Workflows: Interactions and mixed-effects models
-
Model Comparison: Comparing
alternative specifications with
compfit() - Forest Plots: Visualization of regression results