Table One and Demographic Reporting • sardine

Table One and Demographic Reporting with Sardine

This vignette demonstrates how to generate publication-ready descriptive statistics tables (commonly called “Table 1”) and demographic summaries from REDCap data using the sardine package.

What is Table One?

Table One is typically the first table in a research manuscript that presents: - Baseline characteristics of study participants - Summary statistics (n, %, mean, SD, median, IQR) - Comparisons between groups (optional) - Statistical test results (optional)

The generate_table_one() function automates this process with flexible options for variable selection, stratification, statistical testing, and output formatting.

Setup

library(sardine)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(knitr)

Project Setup

# Load environment and create project
load_env()
project <- redcap_project()

# View project structure
project$info()

Basic Table One

All Variables (Default)

The simplest approach includes all non-ID fields:

# Generate table with all variables
table_one <- generate_table_one(project)
print(table_one)

Example output:

Table 1: Baseline Characteristics
Total N: 150

Variable                     Type           Overall
--------------------------------------------------------
N                                           150
Age (years)                  Mean (SD)      42.3 (12.5)
  Missing                    n (%)          3 (2.0%)
Gender                                      n = 150
  Male                       n (%)          85 (56.7%)
  Female                     n (%)          65 (43.3%)
BMI (kg/m²)                  Mean (SD)      26.4 (4.8)
  Missing                    n (%)          8 (5.3%)
Systolic BP (mmHg)           Mean (SD)      128.5 (15.2)
Smoking Status                              n = 150
  Never                      n (%)          90 (60.0%)
  Former                     n (%)          35 (23.3%)
  Current                    n (%)          25 (16.7%)

Note: The table includes an N row at the top and a Type column indicating the statistic type (Mean (SD), Median [IQR], or n (%)).

Selected Variables

Choose specific variables to include:

# Select specific variables for the table
table_one <- generate_table_one(
  project,
  vars = c("age", "gender", "bmi", "systolic_bp", "smoking_status", "education")
)
print(table_one)

Filtering Data

Basic Filtering

Apply filters to include only relevant participants:

# Exclude withdrawn participants
table_one <- generate_table_one(
  project,
  filter = "withdrawn != 1"
)
print(table_one)

Multiple Filters

Combine multiple conditions (all must be TRUE):

# Multiple inclusion criteria
table_one <- generate_table_one(
  project,
  vars = c("age", "gender", "bmi", "baseline_score"),
  filter = c(
    "withdrawn != 1",           # Not withdrawn
    "consent_complete == 1",    # Consent completed
    "age >= 18",                # Adult participants
    "baseline_complete == 2"    # Baseline assessment complete
  )
)
print(table_one)

The function will report how many records were retained after filtering:

ℹ Filtered from 200 to 150 records (75.0% retained)

Stratified Tables (Group Comparisons)

Simple Stratification

Compare characteristics between groups:

# Compare treatment groups
table_one <- generate_table_one(
  project,
  vars = c("age", "gender", "bmi", "baseline_score"),
  strata = "treatment_group"
)
print(table_one)

Example output:

Table 1: Baseline Characteristics by Treatment Group

Variable                     Type           Control        Treatment
----------------------------------------------------------------------
N                                           75             75
Age (years)                  Mean (SD)      42.1 (12.8)    42.5 (12.2)
Gender                                      n = 75         n = 75
  Male                       n (%)          42 (56.0%)     43 (57.3%)
  Female                     n (%)          33 (44.0%)     32 (42.7%)
BMI (kg/m²)                  Mean (SD)      26.2 (4.7)     26.6 (4.9)
Baseline Score               Mean (SD)      65.3 (8.2)     64.8 (8.5)

With Statistical Tests

Add p-values to assess group differences:

# Include statistical tests
table_one <- generate_table_one(
  project,
  vars = c("age", "gender", "bmi", "smoking_status", "baseline_score"),
  strata = "treatment_group",
  test = TRUE,
  test_type = "auto"  # Automatic test selection
)
print(table_one)

Output includes p-values:

Variable                     Type          Control        Treatment      P-value
-----------------------------------------------------------------------------------
N                                          75             75             
Age (years)                  Mean (SD)     42.1 (12.8)    42.5 (12.2)    0.843
Gender                                     n = 75         n = 75         0.867
  Male                       n (%)         42 (56.0%)     43 (57.3%)
  Female                     n (%)         33 (44.0%)     32 (42.7%)
BMI (kg/m²)                  Mean (SD)     26.2 (4.7)     26.6 (4.9)     0.592
Smoking Status                             n = 75         n = 75         0.234
  Never                      n (%)         48 (64.0%)     42 (56.0%)
  Former                     n (%)         16 (21.3%)     19 (25.3%)
  Current                    n (%)         11 (14.7%)     14 (18.7%)
Baseline Score               Mean (SD)     65.3 (8.2)     64.8 (8.5)     0.691

Statistical Tests Used: - Categorical variables: Chi-square test (or Fisher’s exact if expected counts < 5) - Continuous variables (2 groups): t-test (parametric) or Wilcoxon (nonparametric) - Continuous variables (3+ groups): ANOVA (parametric) or Kruskal-Wallis (nonparametric)

Multiple Stratification Variables

Stratify by multiple factors:

# Stratify by treatment group and study site
table_one <- generate_table_one(
  project,
  vars = c("age", "gender", "bmi", "baseline_score"),
  strata = c("treatment_group", "study_site")
)
print(table_one)

Variable Type Control

Automatic Detection

By default, variable types are detected from: 1. User-specified cat_vars and cont_vars (highest priority) 2. REDCap metadata field types 3. Data inspection (variables with <10 unique values treated as categorical)

Force Categorical Variables

Sometimes numeric variables should be treated as categorical:

# Force certain variables to be categorical
table_one <- generate_table_one(
  project,
  vars = c("age", "gender", "education_level", "income_bracket", "bmi"),
  cat_vars = c("education_level", "income_bracket"),  # Force categorical
  cont_vars = c("age", "bmi")  # Force continuous
)

Common cases for forcing categorical: - Ordinal scales (Likert scales, education levels) - Income brackets - Numeric codes representing categories - Count variables you want to display as categories

Force Continuous Variables

Override incorrect REDCap metadata:

# REDCap sometimes misclassifies age as text/categorical
table_one <- generate_table_one(
  project,
  vars = c("age", "weight", "height", "bmi", "gender"),
  cont_vars = c("age", "weight", "height", "bmi"),
  cat_vars = c("gender")
)

Non-Normal Distributions

Median and IQR

For skewed continuous variables, report median [IQR] instead of mean (SD):

# Specify non-normal variables
table_one <- generate_table_one(
  project,
  vars = c("age", "bmi", "cholesterol", "triglycerides", "income"),
  cont_vars = c("age", "bmi", "cholesterol", "triglycerides", "income"),
  non_normal = c("cholesterol", "triglycerides", "income"),  # Use median [IQR]
  strata = "treatment_group",
  test = TRUE,
  test_type = "nonparametric"  # Use nonparametric tests
)

Output:

Variable                     Type            Control          Treatment        P-value
----------------------------------------------------------------------------------------
N                                            75               75               
Age (years)                  Mean (SD)       42.1 (12.8)      42.5 (12.2)      0.843
BMI (kg/m²)                  Mean (SD)       26.2 (4.7)       26.6 (4.9)       0.592
Cholesterol (mg/dL)          Median [IQR]    185 [165-210]    188 [170-215]    0.456
Triglycerides (mg/dL)        Median [IQR]    135 [98-180]     142 [105-185]    0.523
Income ($)                   Median [IQR]    55000 [42000-75000] 58000 [45000-78000] 0.321

Note how the Type column clearly indicates which variables use Mean (SD) and which use Median [IQR].

Test Type Selection

Control which statistical tests are used:

# Parametric tests (assumes normality)
table_parametric <- generate_table_one(
  project,
  vars = c("age", "bmi", "baseline_score"),
  strata = "treatment_group",
  test = TRUE,
  test_type = "parametric"  # t-test or ANOVA
)

# Nonparametric tests (no normality assumption)
table_nonparametric <- generate_table_one(
  project,
  vars = c("age", "bmi", "baseline_score"),
  strata = "treatment_group",
  test = TRUE,
  test_type = "nonparametric"  # Wilcoxon or Kruskal-Wallis
)

# Automatic selection (recommended)
table_auto <- generate_table_one(
  project,
  vars = c("age", "bmi", "baseline_score"),
  strata = "treatment_group",
  test = TRUE,
  test_type = "auto"  # Chooses based on data distribution
)

Missing Data Handling

Include Missing Counts

Show how much data is missing:

table_one <- generate_table_one(
  project,
  vars = c("age", "gender", "bmi", "smoking_status"),
  include_missing = TRUE  # Default
)

Output shows missing data:

Variable                     Overall (N=150)
--------------------------------------------
Age (years)                  42.3 (12.5)
  Missing                    3 (2.0%)
Gender, n (%)
  Male                       85 (56.7%)
  Female                     65 (43.3%)
  Missing                    0 (0.0%)
BMI (kg/m²)                  26.4 (4.8)
  Missing                    8 (5.3%)

Exclude Missing Counts

For cleaner tables when missing data is minimal:

table_one <- generate_table_one(
  project,
  vars = c("age", "gender", "bmi"),
  include_missing = FALSE
)

Output Formatting

Default Data Frame

# Returns a data frame
table_df <- generate_table_one(
  project,
  output_format = "data.frame"  # Default
)

# Can manipulate as needed
table_df %>% filter(Variable != "Missing")

Kable (R Markdown)

For R Markdown documents:

# Generate kable table
table_kable <- generate_table_one(
  project,
  vars = c("age", "gender", "bmi", "treatment_group"),
  strata = "treatment_group",
  output_format = "kable"
)

# Renders nicely in R Markdown
table_kable

GT Package

For advanced table formatting:

# Requires gt package
library(gt)

table_gt <- generate_table_one(
  project,
  vars = c("age", "gender", "bmi", "baseline_score"),
  strata = "treatment_group",
  test = TRUE,
  output_format = "gt"
)

# Can further customize with gt functions
table_gt %>%
  tab_header(
    title = "Baseline Characteristics",
    subtitle = "Randomized Clinical Trial"
  ) %>%
  tab_source_note("Data as of 2025-10-21")

Flextable (Word Export)

For Microsoft Word documents:

# Requires flextable package
library(flextable)

table_flex <- generate_table_one(
  project,
  vars = c("age", "gender", "bmi", "baseline_score"),
  strata = "treatment_group",
  test = TRUE,
  output_format = "flextable"
)

# Export to Word
save_as_docx(table_flex, path = "table_one.docx")

Common Use Cases

Clinical Trial Baseline Characteristics

# Standard baseline table for clinical trial
baseline_table <- generate_table_one(
  project,
  vars = c(
    "age", "gender", "race", "ethnicity",
    "height", "weight", "bmi",
    "systolic_bp", "diastolic_bp",
    "medical_history_diabetes", "medical_history_hypertension",
    "baseline_pain_score", "baseline_function_score"
  ),
  filter = c(
    "consent_complete == 2",
    "eligibility_complete == 2",
    "baseline_complete == 2"
  ),
  strata = "treatment_arm",
  cat_vars = c("gender", "race", "ethnicity", 
               "medical_history_diabetes", "medical_history_hypertension"),
  cont_vars = c("age", "height", "weight", "bmi", 
                "systolic_bp", "diastolic_bp",
                "baseline_pain_score", "baseline_function_score"),
  test = TRUE,
  test_type = "auto",
  digits = 1,
  output_format = "gt"
)

Cohort Study Demographics

# Descriptive table for cohort study
cohort_demographics <- generate_table_one(
  project,
  vars = c(
    "age", "gender", "education", "employment_status",
    "marital_status", "household_income",
    "smoking_status", "alcohol_use", "physical_activity",
    "bmi", "comorbidity_count"
  ),
  filter = "enrolled == 1",
  cat_vars = c("gender", "education", "employment_status", 
               "marital_status", "smoking_status", "alcohol_use"),
  cont_vars = c("age", "bmi", "physical_activity", "comorbidity_count"),
  non_normal = c("household_income", "comorbidity_count"),
  include_missing = TRUE,
  digits = 1
)

Case-Control Comparison

# Compare cases vs controls
case_control_table <- generate_table_one(
  project,
  vars = c(
    "age", "gender", "bmi",
    "smoking_history", "family_history",
    "biomarker_a", "biomarker_b", "biomarker_c"
  ),
  filter = "study_complete == 2",
  strata = "case_status",  # Case vs Control
  cat_vars = c("gender", "smoking_history", "family_history"),
  cont_vars = c("age", "bmi", "biomarker_a", "biomarker_b", "biomarker_c"),
  non_normal = c("biomarker_a", "biomarker_c"),  # Skewed biomarkers
  test = TRUE,
  test_type = "auto",
  output_format = "flextable"
)

Multi-Site Study

# Compare across study sites
site_comparison <- generate_table_one(
  project,
  vars = c("age", "gender", "race", "bmi", "baseline_score"),
  filter = "enrolled == 1",
  strata = "study_site",  # Multiple sites
  cat_vars = c("gender", "race"),
  cont_vars = c("age", "bmi", "baseline_score"),
  test = TRUE,
  digits = 1
)

Subgroup Analysis

# Table for specific subgroup
elderly_table <- generate_table_one(
  project,
  vars = c(
    "age", "gender", "frailty_score", "cognitive_score",
    "falls_past_year", "medications_count"
  ),
  filter = c(
    "age >= 65",
    "baseline_complete == 2"
  ),
  strata = "intervention_group",
  cont_vars = c("age", "frailty_score", "cognitive_score", "medications_count"),
  cat_vars = c("gender", "falls_past_year"),
  non_normal = "medications_count",
  test = TRUE,
  output_format = "gt"
)

Combining with Data Quality Reports

Integrate Table One with data quality checking:

# Check data quality first
quality_report <- analyze_missing_data(project)
print(quality_report)

# Generate Table One after confirming data quality
table_one <- generate_table_one(
  project,
  vars = quality_report %>% 
    filter(missing_percent < 10) %>%  # Only variables with <10% missing
    pull(field),
  filter = "data_quality_passed == 1",
  strata = "study_group"
)

Exporting Results

Save Multiple Formats

# Generate table once
table_one <- generate_table_one(
  project,
  vars = c("age", "gender", "bmi", "treatment_group"),
  strata = "treatment_group",
  test = TRUE
)

# Export to CSV
write.csv(table_one, "table_one.csv", row.names = FALSE)

# Export to formatted table for Word
library(flextable)
table_flex <- generate_table_one(
  project,
  vars = c("age", "gender", "bmi", "treatment_group"),
  strata = "treatment_group",
  test = TRUE,
  output_format = "flextable"
)
save_as_docx(table_flex, path = "table_one.docx")

# Export for LaTeX
library(xtable)
print(xtable(table_one), file = "table_one.tex")

Advanced Tips

Custom Decimal Places

# High precision for lab values
lab_table <- generate_table_one(
  project,
  vars = c("glucose", "hba1c", "cholesterol"),
  digits = 2  # Two decimal places
)

Reproducible Reports

# Document table generation settings
table_settings <- list(
  date_generated = Sys.Date(),
  filter_criteria = c("withdrawn != 1", "consent_complete == 1"),
  stratification = "treatment_group",
  test_type = "auto",
  non_normal_vars = c("triglycerides", "income")
)

# Generate table
table_one <- generate_table_one(
  project,
  vars = c("age", "gender", "bmi"),
  filter = table_settings$filter_criteria,
  strata = table_settings$stratification,
  test = TRUE,
  test_type = table_settings$test_type
)

# Save settings with table
saveRDS(list(table = table_one, settings = table_settings), 
        "table_one_with_metadata.rds")

Automated Reporting

# Function for standardized table generation
generate_standard_table_one <- function(project, 
                                       demographic_vars = c("age", "gender", "race"),
                                       clinical_vars = c("bmi", "systolic_bp"),
                                       strata_var = NULL) {
  
  all_vars <- c(demographic_vars, clinical_vars)
  
  generate_table_one(
    project,
    vars = all_vars,
    filter = c("consent_complete == 2", "withdrawn != 1"),
    strata = strata_var,
    cat_vars = demographic_vars,
    cont_vars = clinical_vars,
    test = !is.null(strata_var),
    include_missing = TRUE,
    digits = 1
  )
}

# Use across multiple projects
table_project1 <- generate_standard_table_one(project, strata_var = "treatment_group")

Best Practices

Always filter appropriately: Exclude withdrawn, incomplete, or test records
Force variable types when needed: REDCap metadata isn’t always perfect
Report non-normal distributions correctly: Use median [IQR] for skewed data
Include missing data information: Transparency about data completeness
Choose appropriate statistical tests: Consider your data distribution and assumptions
Document your choices: Record filter criteria, variable classifications, and test selections
Check balance in randomized trials: P-values shouldn’t be significant at baseline
Verify results: Spot-check a few values against raw data

Summary

The generate_table_one() function provides:

Automated descriptive statistics with appropriate formatting
Flexible variable selection and filtering to focus on relevant data
Stratification and statistical testing for group comparisons
Override controls for variable type classification
Multiple output formats for different publication needs
Comprehensive handling of categorical, continuous, and non-normal data

This enables efficient creation of publication-ready baseline characteristics tables with minimal manual data manipulation, while maintaining full control over statistical approaches and presentation.