Data Preparation and Validation
2026-01-25
Source:vignettes/data-preparation.Rmd
data-preparation.RmdIntroduction
This vignette demonstrates how to prepare and validate data before running multiple imputation with rbmi. The rbmiUtils package provides three key functions for this workflow:
-
validate_data(): Pre-flight validation to catch common data issues -
prepare_data_ice(): Build intercurrent event data from flag columns -
summarise_missingness(): Understand missing data patterns
Using these functions helps ensure your imputation will run successfully and gives you insight into the structure of your missing data.
Example Data
We’ll create a small example dataset to demonstrate the functions:
set.seed(42)
dat <- data.frame(
USUBJID = factor(rep(paste0("SUBJ-", 1:20), each = 4)),
AVISIT = factor(
rep(c("Week 4", "Week 8", "Week 12", "Week 16"), 20),
levels = c("Week 4", "Week 8", "Week 12", "Week 16")
),
TRT = factor(rep(c("Placebo", "Drug A"), each = 40)),
BASE = rep(round(rnorm(20, 50, 10), 1), each = 4),
STRATA = factor(rep(sample(c("Low", "High"), 20, replace = TRUE), each = 4))
)
# Generate CHG with some missing values
dat$CHG <- round(rnorm(80, mean = -2, sd = 3), 1)
# Create missing data patterns:
# - Subjects 3, 8: monotone dropout at Week 12
# - Subject 15: intermittent missing at Week 8
# - Subject 18: monotone dropout at Week 16
dat$CHG[dat$USUBJID == "SUBJ-3" & dat$AVISIT %in% c("Week 12", "Week 16")] <- NA
dat$CHG[dat$USUBJID == "SUBJ-8" & dat$AVISIT %in% c("Week 12", "Week 16")] <- NA
dat$CHG[dat$USUBJID == "SUBJ-15" & dat$AVISIT == "Week 8"] <- NA
dat$CHG[dat$USUBJID == "SUBJ-18" & dat$AVISIT == "Week 16"] <- NA
# Add discontinuation flag
dat$DISCFL <- ifelse(
dat$USUBJID %in% c("SUBJ-3", "SUBJ-8") & dat$AVISIT == "Week 12",
"Y",
ifelse(
dat$USUBJID == "SUBJ-18" & dat$AVISIT == "Week 16",
"Y",
"N"
)
)
head(dat, 12)
#> USUBJID AVISIT TRT BASE STRATA CHG DISCFL
#> 1 SUBJ-1 Week 4 Placebo 63.7 Low -0.6 N
#> 2 SUBJ-1 Week 8 Placebo 63.7 Low 0.1 N
#> 3 SUBJ-1 Week 12 Placebo 63.7 Low 1.1 N
#> 4 SUBJ-1 Week 16 Placebo 63.7 Low -3.8 N
#> 5 SUBJ-2 Week 4 Placebo 44.4 Low -0.5 N
#> 6 SUBJ-2 Week 8 Placebo 44.4 Low -7.2 N
#> 7 SUBJ-2 Week 12 Placebo 44.4 Low -4.4 N
#> 8 SUBJ-2 Week 16 Placebo 44.4 Low -4.6 N
#> 9 SUBJ-3 Week 4 Placebo 53.6 High -9.2 N
#> 10 SUBJ-3 Week 8 Placebo 53.6 High -1.9 N
#> 11 SUBJ-3 Week 12 Placebo 53.6 High NA Y
#> 12 SUBJ-3 Week 16 Placebo 53.6 High NA NValidating Data
The validate_data() function performs comprehensive
checks on your data before imputation:
# This will pass validation
validate_data(dat, vars)The function checks:
- Data is a data.frame
- All required columns exist (subjid, visit, group, outcome, covariates)
- Factor columns are properly typed
- Outcome column is numeric
- Covariates have no missing values
- No duplicate subject-visit combinations
- If
data_iceis provided: valid subjects, visits, and strategies
Catching Validation Errors
Here’s an example of how validation catches issues:
# Create problematic data
bad_dat <- dat
bad_dat$TRT <- as.character(bad_dat$TRT) # Should be factor
bad_dat$BASE[1] <- NA # Covariate with missing value
# This will report all issues at once
tryCatch(
validate_data(bad_dat, vars),
error = function(e) cat(e$message)
)
#> Warning: Column `TRT` is character and will be converted to factor by
#> `rbmi::draws()`
#> Data validation failed:
#> - Covariate `BASE` has 1 missing value(s)Summarising Missing Data
Before imputation, it’s important to understand your missing data patterns:
miss <- summarise_missingness(dat, vars)Missing by Visit
print(miss$by_visit)
#> # A tibble: 8 × 5
#> visit group n n_miss pct_miss
#> <fct> <fct> <int> <int> <dbl>
#> 1 Week 4 Drug A 10 0 0
#> 2 Week 4 Placebo 10 0 0
#> 3 Week 8 Drug A 10 1 10
#> 4 Week 8 Placebo 10 0 0
#> 5 Week 12 Drug A 10 0 0
#> 6 Week 12 Placebo 10 2 20
#> 7 Week 16 Drug A 10 1 10
#> 8 Week 16 Placebo 10 2 20Subject Patterns
print(miss$patterns)
#> # A tibble: 20 × 4
#> USUBJID TRT pattern dropout_visit
#> <fct> <chr> <chr> <chr>
#> 1 SUBJ-1 Placebo complete NA
#> 2 SUBJ-2 Placebo complete NA
#> 3 SUBJ-3 Placebo monotone Week 12
#> 4 SUBJ-4 Placebo complete NA
#> 5 SUBJ-5 Placebo complete NA
#> 6 SUBJ-6 Placebo complete NA
#> 7 SUBJ-7 Placebo complete NA
#> 8 SUBJ-8 Placebo monotone Week 12
#> 9 SUBJ-9 Placebo complete NA
#> 10 SUBJ-10 Placebo complete NA
#> 11 SUBJ-11 Drug A complete NA
#> 12 SUBJ-12 Drug A complete NA
#> 13 SUBJ-13 Drug A complete NA
#> 14 SUBJ-14 Drug A complete NA
#> 15 SUBJ-15 Drug A intermittent NA
#> 16 SUBJ-16 Drug A complete NA
#> 17 SUBJ-17 Drug A complete NA
#> 18 SUBJ-18 Drug A monotone Week 16
#> 19 SUBJ-19 Drug A complete NA
#> 20 SUBJ-20 Drug A complete NASummary by Treatment Group
print(miss$summary)
#> # A tibble: 2 × 5
#> group n_subjects n_complete n_monotone n_intermittent
#> <chr> <int> <int> <int> <int>
#> 1 Drug A 10 8 1 1
#> 2 Placebo 10 8 2 0The three pattern types are:
- complete: No missing outcome values
- monotone: Once a value is missing, all subsequent visits are also missing (dropout)
- intermittent: Missing values with observed values both before and after
Preparing ICE Data
When subjects discontinue treatment, you may want to apply
reference-based imputation strategies. The
prepare_data_ice() function builds the required
data_ice data.frame from a discontinuation flag:
data_ice <- prepare_data_ice(
data = dat,
vars = vars,
ice_col = "DISCFL",
strategy = "JR" # Jump to Reference
)
print(data_ice)
#> USUBJID AVISIT STRATEGY
#> 1 SUBJ-18 Week 16 JR
#> 2 SUBJ-3 Week 12 JR
#> 3 SUBJ-8 Week 12 JRThe function:
- Identifies subjects with ICE flags (
"Y",TRUE, or1) - Takes the first flagged visit per subject
- Assigns the specified imputation strategy
Available strategies are:
-
"MAR": Missing at Random -
"CR": Copy Reference -
"JR": Jump to Reference -
"CIR": Copy Increment from Reference -
"LMCF": Last Mean Carried Forward
Complete Workflow
Here’s how these functions fit into a typical rbmi workflow:
library(rbmi)
library(rbmiUtils)
# 1. Validate data
validate_data(dat, vars)
# 2. Understand missing patterns
miss <- summarise_missingness(dat, vars)
print(miss$summary)
# 3. Prepare ICE data if needed
data_ice <- prepare_data_ice(dat, vars, ice_col = "DISCFL", strategy = "JR")
# 4. Define method
method <- method_bayes(
n_samples = 100,
control = control_bayes(warmup = 200, thin = 2)
)
# 5. Run imputation
draws_obj <- draws(
data = dat,
vars = vars,
data_ice = data_ice,
method = method
)
# 6. Continue with impute() and analyse()Summary
The data preparation functions in rbmiUtils help you:
-
Catch issues early with
validate_data()before running time-consuming imputations -
Understand your data with
summarise_missingness()to characterize missing data patterns -
Simplify ICE handling with
prepare_data_ice()to builddata_icefrom flag columns
These utilities complement the core rbmi package and support reproducible, well-documented analysis workflows.