Perhaps there might be scenarios in which one prefers to use a
parametric test vs the non-parametric bootstrap-based resampling method
to calculate AE under-reporting. simaerep
provides a
function that does bootstrap resampling of the entire study preserving
the actual site parameters such as number of patients and
visit_med75.
Using this pool we can check whether the p-values calculated by
poisson.test()
represent accurate probabilities.
We simulate three studies with varying ae per visit rate and then use all functions we already introduced.
df_visit1 <- sim_test_data_study(
n_pat = 1000,
n_sites = 100,
frac_site_with_ur = 0.2,
ur_rate = 0.5,
max_visit_mean = 20,
max_visit_sd = 4,
ae_per_visit_mean = 0.5
)
df_visit1$study_id <- "ae_per_visit: 0.5"
df_visit2 <- sim_test_data_study(
n_pat = 1000,
n_sites = 100,
frac_site_with_ur = 0.2,
ur_rate = 0.5,
max_visit_mean = 20,
max_visit_sd = 4,
ae_per_visit_mean = 0.2
)
df_visit2$study_id <- "ae_per_visit: 0.2"
df_visit3 <- sim_test_data_study(
n_pat = 1000,
n_sites = 100,
frac_site_with_ur = 0.2,
ur_rate = 0.5,
max_visit_mean = 20,
max_visit_sd = 4,
ae_per_visit_mean = 0.05
)
df_visit3$study_id <- "ae_per_visit: 0.05"
df_visit <- bind_rows(df_visit1, df_visit2, df_visit3)
df_site <- site_aggr(df_visit)
df_sim_sites <- sim_sites(df_site, df_visit)
df_visit
## # A tibble: 59,221 × 9
## patnum site_number is_ur max_visit_mean max_vi…¹ ae_pe…² visit n_ae study…³
## <chr> <chr> <lgl> <dbl> <dbl> <dbl> <int> <int> <chr>
## 1 P000001 S0001 TRUE 20 4 0.25 1 0 ae_per…
## 2 P000001 S0001 TRUE 20 4 0.25 2 1 ae_per…
## 3 P000001 S0001 TRUE 20 4 0.25 3 2 ae_per…
## 4 P000001 S0001 TRUE 20 4 0.25 4 2 ae_per…
## 5 P000001 S0001 TRUE 20 4 0.25 5 2 ae_per…
## 6 P000001 S0001 TRUE 20 4 0.25 6 2 ae_per…
## 7 P000001 S0001 TRUE 20 4 0.25 7 2 ae_per…
## 8 P000001 S0001 TRUE 20 4 0.25 8 4 ae_per…
## 9 P000001 S0001 TRUE 20 4 0.25 9 4 ae_per…
## 10 P000001 S0001 TRUE 20 4 0.25 10 4 ae_per…
## # … with 59,211 more rows, and abbreviated variable names ¹max_visit_sd,
## # ²ae_per_visit_mean, ³study_id
df_site
## # A tibble: 300 × 6
## study_id site_number n_pat n_pat_with_med75 visit_med75 mean_ae_s…¹
## <chr> <chr> <int> <int> <dbl> <dbl>
## 1 ae_per_visit: 0.05 S0001 10 8 16 0
## 2 ae_per_visit: 0.05 S0002 10 9 16 0.333
## 3 ae_per_visit: 0.05 S0003 10 10 15 0.3
## 4 ae_per_visit: 0.05 S0004 10 9 16 0.111
## 5 ae_per_visit: 0.05 S0005 10 9 18 0.556
## 6 ae_per_visit: 0.05 S0006 10 8 17 0
## 7 ae_per_visit: 0.05 S0007 10 10 16 0.6
## 8 ae_per_visit: 0.05 S0008 10 10 15 0.5
## 9 ae_per_visit: 0.05 S0009 10 9 17 0.111
## 10 ae_per_visit: 0.05 S0010 10 10 16 0.5
## # … with 290 more rows, and abbreviated variable name ¹mean_ae_site_med75
df_sim_sites
## # A tibble: 300 × 10
## study…¹ site_…² n_pat n_pat…³ visit…⁴ mean_…⁵ mean_…⁶ n_pat…⁷ pval prob_…⁸
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 ae_per… S0001 10 8 16 0 0.699 860 0.00834 0.004
## 2 ae_per… S0002 10 9 16 0.333 0.696 859 0.307 0.132
## 3 ae_per… S0003 10 10 15 0.3 0.643 898 0.230 0.106
## 4 ae_per… S0004 10 9 16 0.111 0.698 859 0.0252 0.016
## 5 ae_per… S0005 10 9 18 0.556 0.791 704 0.570 0.28
## 6 ae_per… S0006 10 8 17 0 0.742 787 0.00559 0.007
## 7 ae_per… S0007 10 10 16 0.6 0.693 858 1 0.44
## 8 ae_per… S0008 10 10 15 0.5 0.640 898 0.840 0.358
## 9 ae_per… S0009 10 9 17 0.111 0.742 786 0.0178 0.013
## 10 ae_per… S0010 10 10 16 0.5 0.695 858 0.699 0.304
## # … with 290 more rows, and abbreviated variable names ¹study_id, ²site_number,
## # ³n_pat_with_med75, ⁴visit_med75, ⁵mean_ae_site_med75, ⁶mean_ae_study_med75,
## # ⁷n_pat_with_med75_study, ⁸prob_low
sim_studies() reproduces each study a hundred times using bootstrap resampling maintaining the number of patients and the visit_med75 of each site.
## Warning: Strategy 'multiprocess' is deprecated in future (>= 1.20.0)
## [2020-10-30]. Instead, explicitly specify either 'multisession' (recommended) or
## 'multicore'. In the current R session, 'multiprocess' equals 'multisession'.
## Warning in supportsMulticoreAndRStudio(...): [ONE-TIME WARNING] Forked
## processing ('multicore') is not supported when running R from RStudio
## because it is considered unstable. For more details, how to control forked
## processing or not, and how to silence this warning in future R sessions,
## see ?parallelly::supportsMulticore
df_sim_studies <- sim_studies(df_visit, df_site,
r = 100,
parallel = TRUE,
poisson_test = TRUE,
prob_lower = TRUE,
.progress = FALSE)
df_sim_studies
## # A tibble: 30,000 × 8
## r study_id site_number visit_me…¹ n_pat…² n_pat…³ pval prob_…⁴
## <dbl> <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 ae_per_visit: 0.05 S0001 16 8 860 0.389 0.208
## 2 1 ae_per_visit: 0.05 S0002 16 9 859 0.841 0.362
## 3 1 ae_per_visit: 0.05 S0003 15 10 898 1 1
## 4 1 ae_per_visit: 0.05 S0004 16 9 859 0.672 0.363
## 5 1 ae_per_visit: 0.05 S0005 18 9 704 1 1
## 6 1 ae_per_visit: 0.05 S0006 17 8 787 0.543 0.26
## 7 1 ae_per_visit: 0.05 S0007 16 10 858 1 0.48
## 8 1 ae_per_visit: 0.05 S0008 15 10 898 0.105 0.047
## 9 1 ae_per_visit: 0.05 S0009 17 9 786 1 0.562
## 10 1 ae_per_visit: 0.05 S0010 16 10 858 1 0.576
## # … with 29,990 more rows, and abbreviated variable names ¹visit_med75,
## # ²n_pat_with_med75, ³n_pat_study, ⁴prob_low
get_ecd_values() uses the p-value distribution in the dataframe returned from sim_studies() to train a empirical cumulative distribution function for each study which is then used to calculate the probability of a specific p-value or lower from the poisson.test p-value returned from sim_sites()
Note: Dots on the edge of the graph that are cut off have zero value for the corresponding axis.
df_check_pval <- get_ecd_values(df_sim_studies, df_sim_sites, val_str = "pval")
df_check_pval
## # A tibble: 300 × 11
## study…¹ site_…² n_pat n_pat…³ visit…⁴ mean_…⁵ mean_…⁶ n_pat…⁷ pval prob_…⁸
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 ae_per… S0001 10 8 16 0 0.699 860 0.00834 0.004
## 2 ae_per… S0002 10 9 16 0.333 0.696 859 0.307 0.132
## 3 ae_per… S0003 10 10 15 0.3 0.643 898 0.230 0.106
## 4 ae_per… S0004 10 9 16 0.111 0.698 859 0.0252 0.016
## 5 ae_per… S0005 10 9 18 0.556 0.791 704 0.570 0.28
## 6 ae_per… S0006 10 8 17 0 0.742 787 0.00559 0.007
## 7 ae_per… S0007 10 10 16 0.6 0.693 858 1 0.44
## 8 ae_per… S0008 10 10 15 0.5 0.640 898 0.840 0.358
## 9 ae_per… S0009 10 9 17 0.111 0.742 786 0.0178 0.013
## 10 ae_per… S0010 10 10 16 0.5 0.695 858 0.699 0.304
## # … with 290 more rows, 1 more variable: pval_ecd <dbl>, and abbreviated
## # variable names ¹study_id, ²site_number, ³n_pat_with_med75, ⁴visit_med75,
## # ⁵mean_ae_site_med75, ⁶mean_ae_study_med75, ⁷n_pat_with_med75_study,
## # ⁸prob_low
df_check_pval %>%
ggplot(aes(log(pval, base = 10), log(pval_ecd, base = 10))) +
geom_point(alpha = 0.5, size = 2) +
facet_wrap(~ study_id) +
geom_abline(slope = 1, linetype = 2) +
coord_cartesian( xlim = c(-5,0), ylim = c(-5,0) ) +
theme(aspect.ratio = 1)
We can see that the p-values from poisson.test (x-axis) more or less match the probability with which they are represented in the resampled studies (y-axis), but there is some skewness to their relationship.
Note: Dots on the edge of the graph that are cut off have zero value for the corresponding axis. For the tie simulations using 1000 repeats the smallest value greater zero we get for prob_low is 0.001 (1e-3).
df_check_prob <- get_ecd_values(df_sim_studies, df_sim_sites, val_str = "prob_low")
df_check_prob
## # A tibble: 300 × 11
## study…¹ site_…² n_pat n_pat…³ visit…⁴ mean_…⁵ mean_…⁶ n_pat…⁷ pval prob_…⁸
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 ae_per… S0001 10 8 16 0 0.699 860 0.00834 0.004
## 2 ae_per… S0002 10 9 16 0.333 0.696 859 0.307 0.132
## 3 ae_per… S0003 10 10 15 0.3 0.643 898 0.230 0.106
## 4 ae_per… S0004 10 9 16 0.111 0.698 859 0.0252 0.016
## 5 ae_per… S0005 10 9 18 0.556 0.791 704 0.570 0.28
## 6 ae_per… S0006 10 8 17 0 0.742 787 0.00559 0.007
## 7 ae_per… S0007 10 10 16 0.6 0.693 858 1 0.44
## 8 ae_per… S0008 10 10 15 0.5 0.640 898 0.840 0.358
## 9 ae_per… S0009 10 9 17 0.111 0.742 786 0.0178 0.013
## 10 ae_per… S0010 10 10 16 0.5 0.695 858 0.699 0.304
## # … with 290 more rows, 1 more variable: prob_low_ecd <dbl>, and abbreviated
## # variable names ¹study_id, ²site_number, ³n_pat_with_med75, ⁴visit_med75,
## # ⁵mean_ae_site_med75, ⁶mean_ae_study_med75, ⁷n_pat_with_med75_study,
## # ⁸prob_low
df_check_prob %>%
ggplot(aes(log(prob_low, base = 10), log(prob_low_ecd, base = 10))) +
geom_point(alpha = 0.5, size = 2) +
facet_wrap(~ study_id) +
geom_abline(slope = 1, linetype = 2) +
coord_cartesian( xlim = c(-5,0), ylim = c(-5,0) ) +
theme(aspect.ratio = 1)
Here we see that the probabilities from the site simulations (x-axis) perfectly match probability with which they are represented in the resampled studies (y-axis).
We see that the bootstrap method gives us more accurate probabilities than the p-values derived from poisson.test even though the AE generation in the sample data follows a strict poisson process. For real clinical trial data with different types of studies for which it is not clear whether AE generation is truly based on a pure poisson process we recommend to use the bootstrap method. If poisson.test shall be used we recommend checking the applicability as we just demonstrated on clinical data set.