Create Forest Plot for Generalized Linear Models

Generates a publication-ready forest plot that combines a formatted data table with a graphical representation of effect estimates (odds ratios, risk ratios, or coefficients) from a generalized linear model. The plot integrates variable names, group levels, sample sizes, effect estimates with confidence intervals, p-values, and model diagnostics in a single comprehensive visualization designed for manuscripts and presentations.

Usage

glmforest(
  x,
  data = NULL,
  title = "Generalized Linear Model",
  effect_label = NULL,
  digits = 2,
  p_digits = 3,
  conf_level = 0.95,
  font_size = 1,
  annot_size = 3.88,
  header_size = 5.82,
  title_size = 23.28,
  plot_width = NULL,
  plot_height = NULL,
  table_width = 0.6,
  show_n = TRUE,
  show_events = TRUE,
  indent_groups = FALSE,
  condense_table = FALSE,
  bold_variables = FALSE,
  center_padding = 4,
  zebra_stripes = TRUE,
  ref_label = "reference",
  labels = NULL,
  color = NULL,
  exponentiate = NULL,
  qc_footer = TRUE,
  units = "in",
  number_format = NULL
)

Arguments

x

Either a fitted GLM object (class glm or glmerMod), a fit_result object from fit(), or a fullfit_result object from fullfit(). When a fit_result or fullfit_result is provided, the model, data, and labels are automatically extracted.

data

Data frame or data.table containing the original data used to fit the model. If NULL (default) and x is a model, the function attempts to extract data from the model object. If x is a fit_result, data is extracted automatically. Providing data explicitly is recommended when passing a model directly.

title

Character string specifying the plot title displayed at the top. Default is "Generalized Linear Model". Use descriptive titles like "Risk Factors for Disease Outcome" for publication.

effect_label

Character string for the effect measure label on the forest plot axis. If NULL (default), automatically determined based on model family and link function: "Odds Ratio" for logistic regression (family = binomial, link = logit), "Risk Ratio" for log-link models, "Exp(Coefficient)" for other exponential families, or "Coefficient" for identity link.

digits

Integer specifying the number of decimal places for effect estimates and confidence intervals in the data table. Default is 2.

p_digits

Integer specifying the number of decimal places for p-values. Values smaller than 10^(-p_digits) are displayed as "< 0.001" (for p_digits = 3), "< 0.0001" (for p_digits = 4), etc. Default is 3.

conf_level

Numeric confidence level for confidence intervals. Must be between 0 and 1. Default is 0.95 (95% confidence intervals). The CI percentage is automatically displayed in column headers (e.g., "90% CI" when conf_level = 0.90).

font_size

Numeric multiplier controlling the base font size for all text elements. Values > 1 increase all fonts proportionally, values < 1 decrease them. Default is 1.0. Useful for adjusting readability across different output sizes.

annot_size

Numeric value controlling the relative font size for data annotations (variable names, values in table cells). Default is 3.88. Adjust relative to font_size.

header_size

Numeric value controlling the relative font size for column headers ("Variable", "Group", "n", etc.). Default is 5.82. Headers are typically larger than annotations for hierarchy.

title_size

Numeric value controlling the relative font size for the main plot title. Default is 23.28. The title is typically the largest text element.

plot_width

Numeric value specifying the intended output width in specified units. Used for optimizing layout and text sizing. Default is NULL (automatic). Recommended: 10-16 inches for standard publications.

plot_height

Numeric value specifying the intended output height in specified units. Default is NULL (automatic based on number of rows). The function provides recommendations if not specified.

table_width

Numeric value between 0 and 1 specifying the proportion of total plot width allocated to the data table (left side). The forest plot occupies 1 - table_width. Default is 0.6 (60% table, 40% forest). Increase for longer variable names, decrease to emphasize the forest plot.

show_n

Logical. If TRUE, includes a column showing group-specific sample sizes for categorical variables and total sample size for continuous variables. Default is TRUE.

show_events

Logical. If TRUE, includes a column showing the number of events for each group. Relevant for logistic regression (number of cases) and other binary outcomes. Default is TRUE.

indent_groups

Logical. If TRUE, indents factor levels under their parent variable name, creating a hierarchical visual structure. When TRUE, the "Group" column is hidden. Default is FALSE.

condense_table

Logical. If TRUE, condenses binary categorical variables into single rows by showing only the non-reference category. Automatically sets indent_groups = TRUE. Useful for tables with many binary variables. Default is FALSE.

bold_variables

Logical. If TRUE, variable names are displayed in bold. If FALSE (default), variable names are displayed in plain text.

center_padding

Numeric value specifying the horizontal spacing (in character units) between the data table and forest plot. Increase for more separation, decrease to fit more content. Default is 4.

zebra_stripes

Logical. If TRUE, applies alternating gray background shading to different variables (not rows) to improve visual grouping and readability. Default is TRUE.

ref_label

Character string to display for reference categories of factor variables. Typically shown in place of effect estimates. Default is "reference". Common alternatives: "ref", "1.00 (ref)".

labels

Named character vector or list providing custom display labels for variables. Names should match variable names in the model, values are the labels to display. Example: c(age = "Age (years)", bmi = "Body Mass Index"). Default is NULL (use original variable names).

color

Character string specifying the color for effect estimate point markers in the forest plot. Use hex codes or R color names. Default is NULL, which auto-selects based on effect type: "#4BA6B6" (teal) for odds ratios (binomial/quasibinomial with logit link), "#3F87EE" (blue) for rate/risk ratios (Poisson, Gamma, inverse

Gaussian with log link), and "#5A8F5A" (green) for coefficients (Gaussian/identity link). This scheme matches uniforest() and multiforest(). Choose colors that contrast well with black error bars.

exponentiate

Logical. If TRUE, exponentiates coefficients to display odds ratios, risk ratios, etc. If FALSE, shows raw coefficients. Default is NULL, which automatically exponentiates for logit, log, and cloglog links, and shows raw coefficients for identity link.

qc_footer

Logical. If TRUE, displays model quality control statistics in the footer (observations analyzed, model family, deviance, pseudo-R\(^2\), AIC). Default is TRUE.

units

Character string specifying the units for plot dimensions. Options: "in" (inches), "cm" (centimeters), "mm" (millimeters). Default is "in". Affects interpretation of plot_width and plot_height.

number_format

Character string or two-element character vector controlling thousand and decimal separators in formatted output. Named presets:

"us" - Comma thousands, period decimal: 1,234.56 [default]
"eu" - Period thousands, comma decimal: 1.234,56
"space" - Thin-space thousands, period decimal: 1 234.56 (SI/ISO 31-0)
"none" - No thousands separator: 1234.56

Or provide a custom two-element vector c(big.mark, decimal.mark), e.g., c("'", ".") for Swiss-style: 1'234.56.

When NULL (default), uses getOption("summata.number_format", "us"). Set the global option once per session to avoid passing this argument repeatedly:


    options(summata.number_format = "eu")

Value

A ggplot object containing the complete forest plot. The plot can be:

Displayed directly: print(plot)
Saved to file: ggsave("forest.pdf", plot, width = 12, height = 8)
Further customized with ggplot2 functions

The returned object includes an attribute "rec_dims" accessible via attr(plot, "rec_dims"), which is a list containing:

width: Numeric. Recommended plot width in specified units
height: Numeric. Recommended plot height in specified units

These recommendations are automatically calculated based on the number of variables, text sizes, and layout parameters, and are printed to console if plot_width or plot_height are not specified.

Details

Plot Components:

The forest plot consists of several integrated components:

Title: Centered at top, describes the analysis
Data Table (left side): Contains columns for:
- Variable: Predictor names (or custom labels)
- Group: Factor levels (optional, hidden when indenting)
- n: Sample sizes by group (optional)
- Events: Event counts by group (optional)
- Effect (95% CI); p-value: Formatted estimates with p-values
Forest Plot (right side): Graphical display with:
- Point estimates (squares sized by sample size)
- 95% confidence intervals (error bars)
- Reference line (at OR/RR = 1 or coefficient = 0)
- Log scale for odds/risk ratios
- Labeled axis
Model Statistics (footer): Summary of:
- Observations analyzed (with percentage of total data)
- Model family (Binomial, Poisson, etc.)
- Deviance statistics
- Pseudo-R\(^2\) (McFadden)
- AIC

Automatic Effect Measure Selection:

When effect_label = NULL and exponentiate = NULL, the function intelligently selects the appropriate effect measure:

Logistic regression (family = binomial(link = "logit")): Odds Ratios (OR)
Log-link models (link = "log"): Risk Ratios (RR) or Rate Ratios
Other exponential families: exp(coefficient)
Identity link: Raw coefficients

Reference Categories:

For factor variables, the first level (determined by factor ordering or alphabetically for character variables) serves as the reference category:

Displayed with the ref_label instead of an estimate
No confidence interval or p-value shown
Visually aligned with other categories
When condense_table = TRUE, reference-only variables may be omitted entirely

Layout Optimization:

The function automatically optimizes layout based on content:

Calculates appropriate axis ranges to accommodate all confidence intervals
Selects meaningful tick marks on log or linear scales
Sizes point markers proportional to sample size (larger = more data)
Adjusts table width based on variable name lengths when table_width = NULL
Recommends overall dimensions based on number of rows

Visual Grouping Options:

Three display modes are available:

Standard (indent_groups = FALSE, condense_table = FALSE): Separate "Variable" and "Group" columns, all categories shown
Indented (indent_groups = TRUE, condense_table = FALSE): Hierarchical display with groups indented under variables
Condensed (condense_table = TRUE): Binary variables shown in single rows, automatically indented

Zebra Striping:

When zebra_stripes = TRUE, alternating variables (not individual rows) receive light gray backgrounds. This helps visually group all levels of a factor variable together, making the plot easier to read especially with many multi-level factors.

Model Statistics Display:

The footer shows key diagnostic information:

Observations analyzed: Total N and percentage of original data (accounting for missing values)
Null/Residual Deviance: Model fit improvement
Pseudo-R\(^2\): McFadden R\(^2\) = 1 - (log L_1 / log L_2)
AIC: For model comparison (lower is better)

For logistic regression, concordance (C-statistic/AUC) may also be displayed if available.

Saving Plots:

Use ggplot2::ggsave() with recommended dimensions:


  p <- glmforest(model, data)
  dims <- attr(p, "rec_dims")
  ggplot2::ggsave("forest.pdf", p, width = dims$width, height = dims$height)

Or specify custom dimensions:


ggplot2::ggsave("forest.png", p, width = 12, height = 8, dpi = 300)

Examples

data(clintrial)
data(clintrial_labels)

# Create example model
model1 <- glm(os_status ~ age + sex + bmi + treatment,
              data = clintrial, family = binomial)

# Example 1: Basic logistic regression forest plot
p <- glmforest(model1, data = clintrial)
#> Recommended plot dimensions: width = 13.5 in, height = 5.0 in

# \donttest{

old_width <- options(width = 180)

# Example 2: With custom variable labels
plot2 <- glmforest(
    x = model1,
    data = clintrial,
    title = "Risk Factors for Mortality",
    labels = clintrial_labels
)
#> Recommended plot dimensions: width = 15.3 in, height = 5.0 in

# Example 3: Indented layout with formatting options
plot3 <- glmforest(
    x = model1,
    data = clintrial,
    indent_groups = TRUE,
    zebra_stripes = TRUE,
    color = "#D62728",
    labels = clintrial_labels
)
#> Recommended plot dimensions: width = 13.4 in, height = 5.2 in

# Example 4: Condensed layout for many binary variables
model4 <- glm(os_status ~ age + sex + smoking + hypertension + 
                  diabetes + surgery,
              data = clintrial,
              family = binomial)

plot4 <- glmforest(
    x = model4,
    data = clintrial,
    condense_table = TRUE,
    labels = clintrial_labels
)
#> Recommended plot dimensions: width = 12.7 in, height = 5.2 in
# Binary variables shown in single rows

# Example 5: Poisson regression for count data
model5 <- glm(ae_count ~ age + treatment + diabetes + surgery,
               data = clintrial,
               family = poisson)

plot5 <- glmforest(
    x = model5,
    data = clintrial,
    title = "Rate Ratios for Adverse Events",
    labels = clintrial_labels
)
#> Recommended plot dimensions: width = 14.9 in, height = 5.0 in

# Example 6: Save with recommended dimensions
dims <- attr(plot5, "rec_dims")
ggplot2::ggsave(file.path(tempdir(), "forest.pdf"),
                plot5, width = dims$width, height = dims$height)

options(old_width)

# }