Skip to contents

Introduction

Simulate subject-level event reporting of clinical trial sites with the goal of detecting over- and under-reporting.

Monitoring reporting rates of subject-level clinical events (e.g. adverse events, protocol deviations) reported by clinical trial sites is an important aspect of risk-based quality monitoring (RBQM) strategy. Sites that are under-reporting or over-reporting events can be detected using bootstrap simulations during which patients are redistributed between sites. Site-specific distributions of event reporting rates are generated that are used to assign probabilities to the observed reporting rates.

The method is inspired by the ‘infer’ R package and Allen Downey’s blog article: “There is only one test!”.

Why use {simaerep}?

Patient-level events during a clinical study do not occur randomly at visits but they are heavily dependent on the underlying study protocol and the planned procedures. We have illustrated this in our simulated test data by sampling event rates from a gamma distribution with higher rates for early visits and low rates for late visits.

Typically most RBQM tools calculate a ratio between the site event count and the number of patient, the days that they spend on study or the number of visits. Then they employ some outlier detection method to detect outliers (e.g. box plot statistics, percentiles, funnel plots). All of these methods assume a constant event rate over the course of the study or that all patients started the trial at the same time.

In a simulation experiments we could show that {simaerep} outperforms these methods see documentation.

{simaerep} has also been positively evaluated by three different industry members for the purpose of detecting sites that are under-reporting adverse events Koneswarakantha et al. 2024.

Algorithm

  1. Record initial site-level events per visits rate
  2. Replace every patient with a random patient with the same number of visits
  3. Record site-level events per visits rate for new set of patients
  4. Repeat 3) 1000 times to obtain site-specific events per visits rate distribution.
  5. Determine probability to obtain initial site-level events per visits rate from 1) using the distribution generated in 4)
  6. Adjust probabilities using the Benjamin Hochberg Procedure in order to correct alpha error.

Sample Data

Patient

Patient level AE data is characterized by the number of consecutive visits and the number of AEs that have been reported each time. For the maximum consecutive visit we sample from a normal distribution and for the AEs reported at each visit we sample from a poisson distribution.

Here we simulate the AEs generated by 3 patients

set.seed(1)

replicate(
   3,
   sim_test_data_patient(
    .f_sample_max_visit = function() rnorm(1, mean = 20, sd = 4),
    .f_sample_event_per_visit = function(max_visit) rpois(max_visit, 0.5)
  )
)
## [[1]]
##  [1]  0  1  1  2  4  5  6  6  6  6  7  7  8  8  9 12 12
## 
## [[2]]
##  [1] 0 1 1 1 1 1 1 2 2 2 2 2 2 3 4 5 5 6 6 7 8 9 9
## 
## [[3]]
##  [1] 0 0 1 2 2 3 3 3 3 3 3 3 4 4 6 6 6 6 7 7

Study

In order to simulate patient data for an entire study we assume make the simplification that all sites have the same number of patients. Further we specify a fraction of sites that is under-reporting events.

We sample event_rates from a gamma distribution to reflect the non-constant event rates.

set.seed(1)

df_visit <- sim_test_data_study(
  n_pat = 100, # number of patients in study
  n_sites = 10, # number of sites in study
  ratio_out = 0.1, # ratio of sites with outlier
  factor_event_rate = -0.5, # rate of under-reporting
  # non-constant event rates based on gamma distribution
  event_rates = (dgamma(seq(1, 20, 0.5), shape = 5, rate = 2) * 5) + 0.1,
  max_visit = 20,
  max_visit_sd = 10,
  study_id = "A"
)


df_visit %>%
  head(10) %>%
  knitr::kable()
patient_id site_id is_out max_visit_mean max_visit_sd event_per_visit_mean visit n_event study_id
P000001 S0001 TRUE 20 10 0.1762578 1 0 A
P000001 S0001 TRUE 20 10 0.1762578 2 2 A
P000001 S0001 TRUE 20 10 0.1762578 3 2 A
P000001 S0001 TRUE 20 10 0.1762578 4 4 A
P000001 S0001 TRUE 20 10 0.1762578 5 6 A
P000001 S0001 TRUE 20 10 0.1762578 6 7 A
P000001 S0001 TRUE 20 10 0.1762578 7 7 A
P000001 S0001 TRUE 20 10 0.1762578 8 7 A
P000001 S0001 TRUE 20 10 0.1762578 9 7 A
P000001 S0001 TRUE 20 10 0.1762578 10 7 A
df_visit %>%
  select(site_id, is_out) %>%
  distinct() %>%
  knitr::kable()
site_id is_out
S0001 TRUE
S0002 FALSE
S0003 FALSE
S0004 FALSE
S0005 FALSE
S0006 FALSE
S0007 FALSE
S0008 FALSE
S0009 FALSE
S0010 FALSE

In our sample data 1 sites (S0001) is under-reporting events.

Algorithm Execution

Standard

The column event_prob contains the reporting probability for each site as values between -1 and 1. Under- and over-reporting probabilities are inversely related and have been combined into one column. Negative values show high under-reporting, positive values higher over-reporting.

The column event_delta shows the difference of the expected event count minus the reported event count.

evrep <- simaerep(df_visit)
evrep
## simaerep object:
## ----------------
## Plot results using plot() generic.
## Full results available in "df_eval".
## 
## Summary:
## Number of sites: 10
## Number of studies: 1
## 
## Multiplicity correction applied to '*_prob' columns.
## 
## First 10 rows of df_eval:
## # A tibble: 10 × 10
##    study_id site_id event_count event_per_visit_site visits n_pat
##    <chr>    <chr>         <dbl>                <dbl>  <dbl> <int>
##  1 A        S0001            67                0.390    172    10
##  2 A        S0002            97                0.740    131    10
##  3 A        S0003           122                0.587    208    10
##  4 A        S0004           108                0.6      180    10
##  5 A        S0005           121                0.5      242    10
##  6 A        S0006           101                0.470    215    10
##  7 A        S0007           103                0.545    189    10
##  8 A        S0008           105                0.559    188    10
##  9 A        S0009           119                0.601    198    10
## 10 A        S0010            99                0.471    210    10
## # ℹ 4 more variables: event_per_visit_study <dbl>, event_prob_no_mult <dbl>,
## #   event_prob <dbl>, event_delta <dbl>
plot(evrep, study = "A")
## Warning in fortify(data, ...): Arguments in `...` must be used.
##  Problematic argument:
##  na.rm = TRUE
##  Did you misspell an argument name?
## Arguments in `...` must be used.
##  Problematic argument:
##  na.rm = TRUE
##  Did you misspell an argument name?

Left panel shows mean cumulative event reporting per site (blue lines) against mean cumulative event reporting of the entire study (golden line). Sites with either high under-reporting (negative probabilities) or high over-reporting (positive probabilities) are marked by grey dots and plotted in additional panels on the right. N denotes the number of sites. Right panel shows individual sites with total patient cumulative counts as grey lines. N denotes the number of patients, the percentage the under- and over-reporting probability and 394 denotes the difference compared to the expected number of events.

Illustrate Probability Scoring

We can illustrate the probability scoring by increasing the number of sites and switching off the multiplicity correction and not introducing any outlier.

Note that none of these sites are actually outlier but high scores are the results of repetitive testing.

df_visit <- sim_test_data_study(
  n_pat = 1000, # number of patients in study
  n_sites = 100, # number of sites in study
  # non-constant event rates based on gamma distribution
  event_rates = (dgamma(seq(1, 20, 0.5), shape = 5, rate = 2) * 5) + 0.1,
  max_visit = 20,
  max_visit_sd = 10,
  study_id = "A"
)

evrep <- simaerep(df_visit, mult_corr = FALSE)
evrep
## simaerep object:
## ----------------
## Plot results using plot() generic.
## Full results available in "df_eval".
## 
## Summary:
## Number of sites: 100
## Number of studies: 1
## 
## First 10 rows of df_eval:
## # A tibble: 10 × 9
##    study_id site_id event_count event_per_visit_site visits n_pat
##    <chr>    <chr>         <dbl>                <dbl>  <dbl> <int>
##  1 A        S0001            95                0.426    223    10
##  2 A        S0002           116                0.547    212    10
##  3 A        S0003            97                0.424    229    10
##  4 A        S0004           110                0.509    216    10
##  5 A        S0005           109                0.623    175    10
##  6 A        S0006           115                0.5      230    10
##  7 A        S0007           103                0.518    199    10
##  8 A        S0008           133                0.522    255    10
##  9 A        S0009           125                0.710    176    10
## 10 A        S0010           107                0.457    234    10
## # ℹ 3 more variables: event_per_visit_study <dbl>, event_prob <dbl>,
## #   event_delta <dbl>
plot(evrep, study = "A")
## Warning in fortify(data, ...): Arguments in `...` must be used.
##  Problematic argument:
##  na.rm = TRUE
##  Did you misspell an argument name?
## Arguments in `...` must be used.
##  Problematic argument:
##  na.rm = TRUE
##  Did you misspell an argument name?

Multiple Event Types

We can calculate scores for multiple event types in one go. This time we add a site outlier with 50% over-reporting.

set.seed(1)

event_names <- c("x", "y")

df_visit <- sim_test_data_study(
  n_pat = 1000, # number of patients in study
  n_sites = 100, # number of sites in study
  # non-constant event rates based on gamma distribution
  ratio_out = 1/100, # ratio of sites with outlier
  factor_event_rate = +0.5, # rate of over-reporting
  event_names = event_names,
  event_rates = list(
    (dgamma(seq(1, 20, 0.5), shape = 5, rate = 2) * 5) + 0.1,
    (dgamma(seq(1, 20, 0.5), shape = 4, rate = 2) * 5) + 0.05
  ),
  max_visit = 20,
  max_visit_sd = 10,
  study_id = "A"
)

df_visit %>%
  select(site_id, patient_id, visit, n_x, n_y)
## # A tibble: 19,535 × 5
##    site_id patient_id visit   n_x   n_y
##    <chr>   <chr>      <int> <dbl> <dbl>
##  1 S0001   P000001        1     2     3
##  2 S0001   P000001        2     7     7
##  3 S0001   P000001        3     9    15
##  4 S0001   P000001        4    14    17
##  5 S0001   P000001        5    19    19
##  6 S0001   P000001        6    21    21
##  7 S0001   P000001        7    22    21
##  8 S0001   P000001        8    22    21
##  9 S0001   P000001        9    22    21
## 10 S0001   P000001       10    22    21
## # ℹ 19,525 more rows
evrep <- simaerep(df_visit, event_names = event_names)
evrep
## simaerep object:
## ----------------
## Plot results using plot() generic.
## Full results available in "df_eval".
## 
## Summary:
## Number of sites: 100
## Number of studies: 1
## 
## Multiplicity correction applied to '*_prob' columns.
## 
## Reporting probabilities calculated for: x, y 
## 
## First 10 rows of df_eval:
## # A tibble: 10 × 16
##    study_id site_id x_count y_count x_per_visit_site y_per_visit_site visits
##    <chr>    <chr>     <dbl>   <dbl>            <dbl>            <dbl>  <dbl>
##  1 A        S0001       177     175            0.738            0.729    240
##  2 A        S0002       120      99            0.632            0.521    190
##  3 A        S0003       111      89            0.624            0.5      178
##  4 A        S0004       107     105            0.540            0.530    198
##  5 A        S0005        97      83            0.551            0.472    176
##  6 A        S0006       111      97            0.531            0.464    209
##  7 A        S0007       110     106            0.466            0.449    236
##  8 A        S0008       119     116            0.482            0.470    247
##  9 A        S0009       121     103            0.680            0.579    178
## 10 A        S0010        92      82            0.561            0.5      164
## # ℹ 9 more variables: n_pat <int>, x_per_visit_study <dbl>,
## #   y_per_visit_study <dbl>, x_prob_no_mult <dbl>, x_prob <dbl>, x_delta <dbl>,
## #   y_prob_no_mult <dbl>, y_prob <dbl>, y_delta <dbl>
plot(evrep, study = "A", plot_event = "x")
## Warning in fortify(data, ...): Arguments in `...` must be used.
##  Problematic argument:
##  na.rm = TRUE
##  Did you misspell an argument name?
## Arguments in `...` must be used.
##  Problematic argument:
##  na.rm = TRUE
##  Did you misspell an argument name?

plot(evrep, study = "A", plot_event = "y")
## Warning in fortify(data, ...): Arguments in `...` must be used.
##  Problematic argument:
##  na.rm = TRUE
##  Did you misspell an argument name?
## Arguments in `...` must be used.
##  Problematic argument:
##  na.rm = TRUE
##  Did you misspell an argument name?

Input Data Requirements

As input data the cumulative event count for each patient visit is required.

df_visit <- sim_test_data_study()

df_visit %>%
  filter(patient_id == "P000001") %>%
  select(site_id, patient_id, visit, n_event) %>%
  knitr::kable()
site_id patient_id visit n_event
S0001 P000001 1 3
S0001 P000001 2 6
S0001 P000001 3 10
S0001 P000001 4 11
S0001 P000001 5 13
S0001 P000001 6 16
S0001 P000001 7 17
S0001 P000001 8 17
S0001 P000001 9 17
S0001 P000001 10 17
S0001 P000001 11 17
S0001 P000001 12 17
S0001 P000001 13 17
S0001 P000001 14 17
S0001 P000001 15 17

In practice this requires data from different domains within the clinical data set to be merged. This requires the patient and site ids as well as the date and time at which visits and events are occurring.

There will be edge cases e.g. missing dates, missing patient link, events and visits on the same date, that require attention.

gsm.simaerep::InputCumCount that is part of the gsm extension gsm.simaerep uses subject, visit and event data as input and merges the data into a simaerep ready data frame.

However, it follows the gsm naming conventions so the column names will be different than those required by simaerep. We can adapt to the different column names in the following way.

gsm.simaerep has not yet been released on CRAN so we are not demonstrating it’s functionality here.

df_visit_gsm <- df_visit %>%
  mutate(
    study_id = "A"
  ) %>%
  select(
    StudyID = study_id,
    GroupID = site_id,
    SubjectID = patient_id, 
    Denominator = visit, 
    Numerator = n_event
  )

simaerep(
  df_visit_gsm,
  col_names = list(
    study_id = "StudyID",
    site_id = "GroupID",
    patient_id = "SubjectID",
    visit = "Denominator",
    n_ae = "Numerator"
  )
)
## simaerep object:
## ----------------
## Plot results using plot() generic.
## Full results available in "df_eval".
## 
## Summary:
## Number of sites: 20
## Number of studies: 1
## 
## Multiplicity correction applied to '*_prob' columns.
## 
## First 10 rows of df_eval:
## # A tibble: 10 × 10
##    StudyID GroupID ae_count ae_per_visit_site visits n_pat ae_per_visit_study
##    <chr>   <chr>      <dbl>             <dbl>  <dbl> <int>              <dbl>
##  1 A       S0001        615             0.616    998    50              0.601
##  2 A       S0002        607             0.611    994    50              0.603
##  3 A       S0003        615             0.617    996    50              0.604
##  4 A       S0004        613             0.637    962    50              0.618
##  5 A       S0005        589             0.623    945    50              0.630
##  6 A       S0006        576             0.579    995    50              0.604
##  7 A       S0007        618             0.610   1013    50              0.599
##  8 A       S0008        623             0.614   1014    50              0.594
##  9 A       S0009        583             0.610    956    50              0.624
## 10 A       S0010        574             0.585    982    50              0.612
## # ℹ 3 more variables: ae_prob_no_mult <dbl>, ae_prob <dbl>, ae_delta <dbl>

Multiple Studies

simaerep probabilities can be calculated for multiple studies at once.

df_visit <- bind_rows(
  sim_test_data_study(study_id = "A"),
  sim_test_data_study(study_id = "B")
)

simaerep(df_visit)
## simaerep object:
## ----------------
## Plot results using plot() generic.
## Full results available in "df_eval".
## 
## Summary:
## Number of sites: 40
## Number of studies: 2
## 
## Multiplicity correction applied to '*_prob' columns.
## 
## First 10 rows of df_eval:
## # A tibble: 10 × 10
##    study_id site_id event_count event_per_visit_site visits n_pat
##    <chr>    <chr>         <dbl>                <dbl>  <dbl> <int>
##  1 A        S0001           576                0.579    995    50
##  2 A        S0002           608                0.637    955    50
##  3 A        S0003           572                0.570   1003    50
##  4 A        S0004           619                0.607   1019    50
##  5 A        S0005           598                0.631    947    50
##  6 A        S0006           604                0.629    960    50
##  7 A        S0007           584                0.608    961    50
##  8 A        S0008           595                0.619    961    50
##  9 A        S0009           614                0.638    962    50
## 10 A        S0010           623                0.644    967    50
## # ℹ 4 more variables: event_per_visit_study <dbl>, event_prob_no_mult <dbl>,
## #   event_prob <dbl>, event_delta <dbl>

In Database Calculation

We can demonstrate the database-backend compatibility by using a connection to a in memory duckdb database. In order to set the number of replications we need to create a new table in our back-end that has one column with as many rows as the desired replications.

A lazy reference to this table can then be passed to the r parameter.

con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")
df_r <- tibble(rep = seq(1, 1000))

df_visit <- sim_test_data_study(
  n_pat = 100, # number of patients in study
  n_sites = 10, # number of sites in study
  ratio_out = 0.1, # ratio of sites with outlier
  factor_event_rate = -0.5, # rate of under-reporting
  # non-constant event rates based on gamma distribution
  event_rates = (dgamma(seq(1, 20, 0.5), shape = 5, rate = 2) * 5) + 0.1,
  max_visit = 20,
  max_visit_sd = 10,
  study_id = "A"
)

dplyr::copy_to(con, df_visit, "visit")
dplyr::copy_to(con, df_r, "r")

tbl_visit <- tbl(con, "visit")
tbl_r <- tbl(con, "r")

evrep_db <- simaerep(tbl_visit, r = tbl_r)

When inspecting df_eval we see that it is still a lazy table object.

evrep_db
## simaerep object:
## ----------------
## Plot results using plot() generic.
## Full results available in "df_eval".
## 
## Summary:
## Multiplicity correction applied to '*_prob' columns.
## 
## First 10 rows of df_eval:
## # Source:     SQL [?? x 10]
## # Database:   DuckDB v1.3.0 [koneswab@Darwin 23.6.0:R 4.4.1/:memory:]
## # Ordered by: study_id, site_id
##    study_id site_id event_count event_per_visit_site visits n_pat
##    <chr>    <chr>         <dbl>                <dbl>  <dbl> <dbl>
##  1 A        S0010           107                0.480    223    10
##  2 A        S0001            61                0.302    202    10
##  3 A        S0005            96                0.381    252    10
##  4 A        S0003           125                0.595    210    10
##  5 A        S0008           103                0.462    223    10
##  6 A        S0007           122                0.552    221    10
##  7 A        S0004           116                0.483    240    10
##  8 A        S0009           123                0.530    232    10
##  9 A        S0006           119                0.463    257    10
## 10 A        S0002           105                0.544    193    10
## # ℹ 4 more variables: event_per_visit_study <dbl>, event_prob_no_mult <dbl>,
## #   event_prob <dbl>, event_delta <dbl>

We can convert it to sql code. The cte option makes the sql code more readable.

sql_eval <- dbplyr::sql_render(evrep_db$df_eval, sql_options = dbplyr::sql_options(cte = TRUE))
stringr::str_trunc(sql_eval, 500)
## <SQL> WITH q01 AS (
##   SELECT
##     patient_id AS patnum,
##     site_id AS site_number,
##     is_out,
##     max_visit_mean,
##     max_visit_sd,
##     event_per_visit_mean,
##     visit,
##     n_event,
##     study_id
##   FROM visit
## ),
## q02 AS (
##   SELECT
##     patnum,
##     site_number,
##     is_out,
##     max_visit_mean,
##     max_visit_sd,
##     event_per_visit_mean,
##     CAST(visit AS NUMERIC) AS visit,
##     CAST(n_event AS NUMERIC) AS n_event,
##     study_id
##   FROM q01
## ),
## q03 AS (
##   SELECT DISTINCT study_id, site_number
##   FROM q02 q01...

We can take that code and wrap it in a CREATE TABLE statement

sql_create <- glue::glue("CREATE TABLE eval AS ({sql_eval})")
DBI::dbExecute(con, sql_create)
## [1] 10

Retrieve the new table from the database.

tbl_eval <- tbl(con, "eval")
tbl_eval
## # Source:   table<eval> [?? x 10]
## # Database: DuckDB v1.3.0 [koneswab@Darwin 23.6.0:R 4.4.1/:memory:]
##    study_id site_id event_count event_per_visit_site visits n_pat
##    <chr>    <chr>         <dbl>                <dbl>  <dbl> <dbl>
##  1 A        S0008           103                0.462    223    10
##  2 A        S0007           122                0.552    221    10
##  3 A        S0010           107                0.480    223    10
##  4 A        S0005            96                0.381    252    10
##  5 A        S0006           119                0.463    257    10
##  6 A        S0003           125                0.595    210    10
##  7 A        S0002           105                0.544    193    10
##  8 A        S0009           123                0.530    232    10
##  9 A        S0001            61                0.302    202    10
## 10 A        S0004           116                0.483    240    10
## # ℹ 4 more variables: event_per_visit_study <dbl>, event_prob_no_mult <dbl>,
## #   event_prob <dbl>, event_delta <dbl>

We plot the results from the {simaerep} object.

plot(evrep_db, study = "A")
## Warning in fortify(data, ...): Arguments in `...` must be used.
##  Problematic argument:
##  na.rm = TRUE
##  Did you misspell an argument name?
## Arguments in `...` must be used.
##  Problematic argument:
##  na.rm = TRUE
##  Did you misspell an argument name?

DBI::dbDisconnect(con)