Inframe Simulation Using Table Operations • simaerep

Load

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tibble))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(simaerep))

Introduction

With the 0.6.0 release we have added an alternative version of the {simaerep} algorithm that was coded using solely dbplyr compatible table operations.

expand the patients for each site r times
join each patient with a random eligible patient from same study
for each replicate calculate event per visit rate per site
calculate the ratio of having a lower event per visit rate than actually observed

This comes with the following advantages and disadvantages:

Patients are individually matched with patients that have reached the same visit in the study. No need to pick visit_med75 as an evaluation point.
dbplyr compatibility means that code execution can be done in a database back-end as opposed to in-memory.
Matching patients individually is more costly, this increases in-memory computation time
Limited patient sample pool for patients that have more visits than other patients in study.

Sample Data

set.seed(1)

df_visit <- sim_test_data_study(
  n_pat = 1000, # number of patients in study
  n_sites = 100, # number of sites in study
  frac_site_with_ur = 0.05, # fraction of sites under-reporting
  ur_rate = 0.4, # rate of under-reporting
  ae_per_visit_mean = 0.5 # mean AE per patient visit
)

df_visit$study_id <- "A"

Patient-Level Matching

Here we use the standard version of the algorithm.

aerep_trad <- simaerep(df_visit)
plot(aerep_trad)

To use the patient-level matching algorithm we set inframe=TRUE and visit_med75=FALSE.

The original algorithm uses fixed seeds before sampling while the inframe method does not. In order to obtain consistent results we need to manually set a seed.

set.seed(1)

aerep_inframe <- simaerep(
  df_visit,
  inframe = TRUE,
  visit_med75 = FALSE
)

The plot shows that for all sites 10/10 patients were used and none were excluded. We also observe that the site average has become more noisy as less patients are used to calculate the averages for the higher visit numbers.

plot(aerep_inframe)

The inframe method includes this noisier data but does not compare average event counts but event per visit rates. We can find events_per_visit_site and events_per_visit_study in df_eval. The latter is the average event rate obtained in the simulation in which each patient has been resampled according to its maximum visit.

aerep_inframe$df_eval

## # A tibble: 100 × 10
##    study_id site_number events events_per_visit_site visits n_pat
##    <chr>    <chr>        <dbl>                 <dbl>  <dbl> <int>
##  1 A        S0001           50                 0.279    179    10
##  2 A        S0002           59                 0.281    210    10
##  3 A        S0003           55                 0.291    189    10
##  4 A        S0004           59                 0.312    189    10
##  5 A        S0005           58                 0.297    195    10
##  6 A        S0006          100                 0.493    203    10
##  7 A        S0007           89                 0.408    218    10
##  8 A        S0008           95                 0.559    170    10
##  9 A        S0009          101                 0.518    195    10
## 10 A        S0010           87                 0.470    185    10
## # ℹ 90 more rows
## # ℹ 4 more variables: events_per_visit_study <dbl>, prob_low <dbl>,
## #   prob_low_adj <dbl>, prob_low_prob_ur <dbl>

We can also force the inframe method to use the visit_med75 this will pre-filter df_visit, which adds an extra step and decreases performance.

set.seed(1)

aerep_inframe_visit_med75 <- simaerep(
  df_visit,
  inframe = TRUE,
  visit_med75 = TRUE
)

plot(aerep_inframe_visit_med75)

Calculate Reporting Probabilities for Multiple Event Types

We can also calculate the reporting probabilities for multiple event types. This is done by passing a list of event types to the event_names parameter.

df_visit <- sim_test_data_study(
  event_names = c("ae", "pd"),
  ae_per_visit_mean = c(0.5, 0.4)
)

df_visit$study_id <- "A"

df_visit %>%
  head(5) %>%
  knitr::kable()

patnum	site_number	is_ur	max_visit_mean	max_visit_sd	ae_per_visit_mean	pd_per_visit_mean	visit	n_ae	n_pd	study_id
P000001	S0001	FALSE	20	4	0.5	0.4	1	1	1	A
P000001	S0001	FALSE	20	4	0.5	0.4	2	2	1	A
P000001	S0001	FALSE	20	4	0.5	0.4	3	2	1	A
P000001	S0001	FALSE	20	4	0.5	0.4	4	2	1	A
P000001	S0001	FALSE	20	4	0.5	0.4	5	2	1	A

eventrep <- simaerep(df_visit, event_names = c("ae", "pd"), inframe = TRUE, visit_med75 = FALSE)

eventrep$df_eval %>%
  select(site_number, starts_with("ae_")) %>%
  head(5) %>%
  knitr::kable()

site_number	ae_events	ae_per_visit_site	ae_per_visit_study	ae_prob_low	ae_prob_low_adj	ae_prob_low_prob_ur
S0001	456	0.4835631	0.4999226	0.249	0.7323077	0.2676923
S0002	511	0.5350785	0.5112042	0.842	0.9273684	0.0726316
S0003	458	0.4751037	0.4991961	0.147	0.7323077	0.2676923
S0004	480	0.4994797	0.5010447	0.476	0.7323077	0.2676923
S0005	489	0.4817734	0.4995084	0.233	0.7323077	0.2676923

eventrep$df_eval %>%
  select(site_number, starts_with("pd_")) %>%
  head(5) %>%
  knitr::kable()

site_number	pd_events	pd_per_visit_site	pd_per_visit_study	pd_prob_low	pd_prob_low_adj	pd_prob_low_prob_ur
S0001	334	0.3541888	0.3856405	0.050	0.4600000	0.5400000
S0002	346	0.3623037	0.3824126	0.155	0.6200000	0.3800000
S0003	386	0.4004149	0.3879990	0.746	0.9666667	0.0333333
S0004	384	0.3995838	0.3885390	0.727	0.9666667	0.0333333
S0005	382	0.3763547	0.3882680	0.279	0.6975000	0.3025000

DB

We can demonstrate the database-backend compatibility by using a connection to a in memory duckdb database. In order to set the number of replications we need to create a new table in our back-end that has one column with as many rows as the desired replications.

A lazy reference to this table can then be passed to the r parameter.

con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")
df_r <- tibble(rep = seq(1, 1000))

dplyr::copy_to(con, df_visit, "visit")
dplyr::copy_to(con, df_r, "r")

tbl_visit <- tbl(con, "visit")
tbl_r <- tbl(con, "r")


aerep <- simaerep(tbl_visit, r = tbl_r, visit_med75 = FALSE)

When inspecting df_eval we see that it is still a lazy table object.

aerep$df_eval

## # Source:     SQL [?? x 10]
## # Database:   DuckDB v1.2.1 [koneswab@Darwin 23.6.0:R 4.4.1/:memory:]
## # Ordered by: study_id, site_number
##    study_id site_number events events_per_visit_site visits n_pat
##    <chr>    <chr>        <dbl>                 <dbl>  <dbl> <dbl>
##  1 A        S0007          492                 0.489   1006    50
##  2 A        S0001          456                 0.484    943    50
##  3 A        S0003          458                 0.475    964    50
##  4 A        S0016          467                 0.485    962    50
##  5 A        S0010          480                 0.498    964    50
##  6 A        S0004          480                 0.499    961    50
##  7 A        S0020          503                 0.527    955    50
##  8 A        S0012          457                 0.477    958    50
##  9 A        S0005          489                 0.482   1015    50
## 10 A        S0017          497                 0.504    987    50
## # ℹ more rows
## # ℹ 4 more variables: events_per_visit_study <dbl>, prob_low <dbl>,
## #   prob_low_adj <dbl>, prob_low_prob_ur <dbl>

We can convert it to sql code. The cte option makes the sql code more readable.

sql_eval <- dbplyr::sql_render(aerep$df_eval, sql_options = dbplyr::sql_options(cte = TRUE))
stringr::str_trunc(sql_eval, 500)

## <SQL> WITH q01 AS (
##   SELECT
##     patnum,
##     site_number,
##     is_ur,
##     max_visit_mean,
##     max_visit_sd,
##     ae_per_visit_mean,
##     pd_per_visit_mean,
##     CAST(visit AS NUMERIC) AS visit,
##     CAST(n_ae AS NUMERIC) AS n_ae,
##     n_pd,
##     study_id
##   FROM visit
## ),
## q02 AS (
##   SELECT DISTINCT study_id, site_number
##   FROM q01
##   GROUP BY study_id, site_number, patnum
## ),
## q03 AS (
##   SELECT ROW_NUMBER() OVER () AS rep
##   FROM r
## ),
## q04 AS (
##   SELECT study_id, site_number, patnum, MAX(visit) AS max_visit_per_...

We can take that code and wrap it in a CREATE TABLE statement

sql_create <- glue::glue("CREATE TABLE eval AS ({sql_eval})")
DBI::dbExecute(con, sql_create)

## [1] 20

Retrieve the new table from the database.

tbl_eval <- tbl(con, "eval")
tbl_eval

## # Source:   table<eval> [?? x 10]
## # Database: DuckDB v1.2.1 [koneswab@Darwin 23.6.0:R 4.4.1/:memory:]
##    study_id site_number events events_per_visit_site visits n_pat
##    <chr>    <chr>        <dbl>                 <dbl>  <dbl> <dbl>
##  1 A        S0006          516                 0.493   1047    50
##  2 A        S0017          497                 0.504    987    50
##  3 A        S0016          467                 0.485    962    50
##  4 A        S0009          508                 0.542    937    50
##  5 A        S0010          480                 0.498    964    50
##  6 A        S0003          458                 0.475    964    50
##  7 A        S0014          493                 0.491   1005    50
##  8 A        S0015          514                 0.511   1006    50
##  9 A        S0011          443                 0.458    968    50
## 10 A        S0007          492                 0.489   1006    50
## # ℹ more rows
## # ℹ 4 more variables: events_per_visit_study <dbl>, prob_low <dbl>,
## #   prob_low_adj <dbl>, prob_low_prob_ur <dbl>

We plot the results from the {simaerep} object.

plot(aerep)

Or more efficiently by using plot_study() when we have already written the simaerep results into the database. Here we avoid that the results are being recalculated just for the sake of creating a plot. However this requires that we save df_site to the database as well.

sql_site <- dbplyr::sql_render(aerep$df_site)
DBI::dbExecute(con, glue::glue("CREATE TABLE site AS ({sql_site})"))

## [1] 20

tbl_site <- tbl(con, "site")

plot_study(tbl_visit, tbl_site, tbl_eval, study = "A")

DBI::dbDisconnect(con)

In Memory Calculation Times

Here we perform some examplary tests to illustrate the increase in-memory calculation time of the inframe calculation method.

This is the calculation time for the default settings

system.time({simaerep(df_visit, inframe = FALSE, visit_med75 = TRUE, under_only = TRUE, progress = FALSE)})

##    user  system elapsed 
##   0.310   0.004   0.315

inframe calculation time is higher

system.time({simaerep(df_visit, inframe = TRUE, visit_med75 = FALSE, under_only = TRUE)})

##    user  system elapsed 
##   0.701   0.039   0.739