Efficient Storage of Imputed Data

Introduction

When performing multiple imputation with {rbmi} using many imputations (e.g., 100-1000), the full imputed dataset can become very large. However, most of this data is redundant: observed values are identical across all imputations.

The rbmiUtils package provides two functions to address this:

reduce_imputed_data(): Extract only the imputed values (originally missing)
expand_imputed_data(): Reconstruct the full dataset when needed

This approach can reduce storage requirements by 90% or more, depending on the proportion of missing data.

The Storage Problem

Consider a typical clinical trial dataset:

500 subjects
5 visits per subject = 2,500 rows
5% missing data = 125 missing values
1,000 imputations

Full storage: 2,500 rows × 1,000 imputations = 2.5 million rows

Reduced storage: 125 missing values × 1,000 imputations = 125,000 rows (5% of full size)

Setup

library(dplyr)
library(rbmi)
library(rbmiUtils)

Example with Package Data

The rbmiUtils package includes example datasets we can use:

data("ADMI", package = "rbmiUtils")  # Full imputed dataset
data("ADEFF", package = "rbmiUtils") # Original data with missing values

# Check dimensions
cat("Full imputed dataset (ADMI):", nrow(ADMI), "rows\n")
#> Full imputed dataset (ADMI): 100000 rows
cat("Number of imputations:", length(unique(ADMI$IMPID)), "\n")
#> Number of imputations: 100

Reducing Imputed Data

First, prepare the original data to match the imputed data structure:

original <- ADEFF |>
 mutate(
   TRT = TRT01P,
   USUBJID = as.character(USUBJID)
 )

# Count missing values
n_missing <- sum(is.na(original$CHG))
cat("Missing values in original data:", n_missing, "\n")
#> Missing values in original data: 44

Define the variables specification:

vars <- set_vars(
 subjid = "USUBJID",
 visit = "AVISIT",
 group = "TRT",
 outcome = "CHG"
)

Now reduce the imputed data:

reduced <- reduce_imputed_data(ADMI, original, vars)

cat("Full imputed rows:", nrow(ADMI), "\n")
#> Full imputed rows: 100000
cat("Reduced rows:", nrow(reduced), "\n")
#> Reduced rows: 4400
cat("Compression ratio:", round(100 * nrow(reduced) / nrow(ADMI), 1), "%\n")
#> Compression ratio: 4.4 %

What’s in the Reduced Data?

The reduced dataset contains only the rows that were originally missing:

# First few rows
head(reduced)
#> # A tibble: 6 × 12
#>   IMPID STRATA REGION  REGIONC TRT    BASE   CHG AVISIT USUBJID CRIT1FLN CRIT1FL
#>   <dbl> <chr>  <chr>     <dbl> <chr> <dbl> <dbl> <chr>  <chr>      <dbl> <chr>  
#> 1     1 A      North …       1 Plac…    12 -1.96 Week … ID011          0 N      
#> 2     1 A      Europe        3 Drug…     3 -3.71 Week … ID014          0 N      
#> 3     1 B      Europe        3 Drug…     9 -1.96 Week … ID018          0 N      
#> 4     1 A      Asia          4 Drug…    10 -5.55 Week … ID033          0 N      
#> 5     1 A      Asia          4 Drug…     0 -1.28 Week … ID061          0 N      
#> 6     1 A      South …       2 Plac…     5 -2.60 Week … ID071          0 N      
#> # ℹ 1 more variable: CRIT <chr>

# Structure matches original imputed data
cat("\nColumns in reduced data:\n")
#> 
#> Columns in reduced data:
cat(paste(names(reduced), collapse = ", "))
#> IMPID, STRATA, REGION, REGIONC, TRT, BASE, CHG, AVISIT, USUBJID, CRIT1FLN, CRIT1FL, CRIT

Each row represents an imputed value for a specific subject-visit-imputation combination.

Expanding Back to Full Data

When you need to run analyses, expand the reduced data back to full form:

expanded <- expand_imputed_data(reduced, original, vars)

cat("Expanded rows:", nrow(expanded), "\n")
#> Expanded rows: 100000
cat("Original ADMI rows:", nrow(ADMI), "\n")
#> Original ADMI rows: 100000

Verifying Data Integrity

Let’s verify that the round-trip preserves data integrity:

# Sort both datasets for comparison
admi_sorted <- ADMI |>
 arrange(IMPID, USUBJID, AVISIT)

expanded_sorted <- expanded |>
 arrange(IMPID, USUBJID, AVISIT)

# Compare CHG values
all_equal <- all.equal(
 admi_sorted$CHG,
 expanded_sorted$CHG,
 tolerance = 1e-10
)

cat("Data integrity check:", all_equal, "\n")
#> Data integrity check: TRUE

Practical Workflow

Here’s how to integrate efficient storage into your workflow:

Save Reduced Data

# After imputation
impute_obj <- impute(draws_obj, references = c("Placebo" = "Placebo", "Drug A" = "Placebo"))
full_imputed <- get_imputed_data(impute_obj)

# Reduce for storage
reduced <- reduce_imputed_data(full_imputed, original_data, vars)

# Save both (reduced is much smaller)
saveRDS(reduced, "imputed_reduced.rds")
saveRDS(original_data, "original_data.rds")

Load and Analyse

# Load saved data
reduced <- readRDS("imputed_reduced.rds")
original_data <- readRDS("original_data.rds")

# Expand when needed for analysis
full_imputed <- expand_imputed_data(reduced, original_data, vars)

# Run analysis
ana_obj <- analyse_mi_data(
 data = full_imputed,
 vars = vars,
 method = method,
 fun = ancova
)

Storage Comparison

Here’s a comparison of storage requirements for different scenarios:

Subjects	Visits	Missing %	Imputations	Full Rows	Reduced Rows	Savings
500	5	5%	100	250,000	12,500	95%
500	5	5%	1,000	2,500,000	125,000	95%
1,000	8	10%	500	4,000,000	400,000	90%
200	4	20%	1,000	800,000	160,000	80%

The savings scale with:

Lower missing % = greater savings
More imputations = same relative savings, but more absolute reduction

When to Use This Approach

Use reduced storage when:

Running many imputations (100+)
Saving imputed data for later analysis
Sharing data between team members
Working with memory constraints

Keep full data when:

Working interactively with few imputations
Performing exploratory analysis
Storage is not a concern

Edge Cases

No Missing Data

If the original data has no missing values, reduce_imputed_data() returns an empty data.frame:

# If original has no missing values
reduced <- reduce_imputed_data(full_imputed, complete_data, vars)
nrow(reduced)
#> [1] 0

# expand_imputed_data handles this correctly
expanded <- expand_imputed_data(reduced, complete_data, vars)
# Returns original data with IMPID = "1"

Single Imputation

The functions work with any number of imputations, including just one.

Summary

The reduce_imputed_data() and expand_imputed_data() functions provide an efficient way to store imputed datasets:

Reduce after imputation to store only what’s necessary
Expand before analysis to reconstruct full datasets
Verify data integrity is preserved through round-trip

This approach is particularly valuable when working with large numbers of imputations or when storage and memory are constrained.

For the complete analysis workflow using imputed data, see vignette('pipeline').

2026-02-16