Binary Event Over-Reporting - Subject Discontinuation • simaerep

Load

suppressPackageStartupMessages(library(tibble))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(simaerep))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(clindata))

rawplus_studcomp <- clindata::rawplus_studcomp
rawplus_visdt <- clindata::rawplus_visdt

Introduction

This vignette will explore the process and viability of using the {simaerep} algorithm to detect patient discontinuities and flag sites where there are more discontinuations than expected.

Patient Discontinuations

A discontinuation occurs any time a patient leaves a clinical trial for any reason. It is important to minimize patient discontinuation in order to maintain patient participation and therefore robust data collection in clinical studies.

Sample Data

The sample data is from the open source r package {clindata}. Sampled data is from data frames rawplus_visdt and rawplus_studcomp.

Implementation

Cumulative vs Binary events

The main difference between detecting cumulative clinical events such as AEs and and binary events such as discontinuations is that with cumulative events there can be multiple occurrences per patient, but with binary events there can only be one event per patient. This difference needs to be addressed and the data must be formulated into a version that {simaerep} can process, otherwise the algorithm’s output will not make sense. The other major difference is that AE detection is focused on determining sites below the norm (under-reporting), but with discontinuation detection we want to look for sites that are above the norm (over-reporting).

Data Preparation

rawplus_studycomp contains one entry for each patient that has discontinued. We interpret column mincreated_dts to be the timestamp the discontinuation has been entered and therefore to be a good proxy for the actual discontinuation date.

rawplus_studcomp

## # A tibble: 133 × 31
##    studyid        siteid invid scrnid subjid subjectid   datapagename datapageid
##    <chr>          <chr>  <chr> <chr>  <chr>  <chr>       <chr>        <chr>     
##  1 AA-AA-000-0000 58     0X091 020    1236   X0911236-0… Study Compl… 1         
##  2 AA-AA-000-0000 128    0X149 123    1023   X1491023-1… Study Compl… 1         
##  3 AA-AA-000-0000 155    0X125 058    1346   X1251346-0… Study Compl… 1         
##  4 AA-AA-000-0000 43     0X159 113    0760   X1590760-1… Study Compl… 1         
##  5 AA-AA-000-0000 127    0X043 058    0854   X0430854-0… Study Compl… 1         
##  6 AA-AA-000-0000 71     0X083 142    0561   X0830561-1… Study Compl… 1         
##  7 AA-AA-000-0000 140    0X161 091    0290   X1610290-0… Study Compl… 1         
##  8 AA-AA-000-0000 53     0X015 142    1127   X0151127-1… Study Compl… 1         
##  9 AA-AA-000-0000 71     0X083 020    1152   X0831152-0… Study Compl… 1         
## 10 AA-AA-000-0000 184    0X123 123    0720   X1230720-1… Study Compl… 1         
## # ℹ 123 more rows
## # ℹ 23 more variables: foldername <chr>, instancename <chr>, recordid <chr>,
## #   record_dt <chr>, recordposition <dbl>, mincreated_dts <chr>,
## #   maxupdated_dts <chr>, compyn <chr>, compreas <chr>, subjid_nsv <chr>,
## #   scrnid_nsv <chr>, subjinit_nsv <chr>, invid_nsv <chr>, subject_nsv <chr>,
## #   instanceid_nsv <dbl>, folder_nsv <chr>, folderseq_nsv <dbl>,
## #   compyn_std_nsv <chr>, compreas_std_nsv <chr>, compfu_nsv <chr>, …

df_disc <- rawplus_studcomp %>%
  rename(date = mincreated_dts) %>%
  mutate(event = "disc") %>%
  distinct(studyid, siteid, subjid, date, event)

df_disc %>%
  head() %>%
  knitr::kable()

studyid	siteid	subjid	date	event
AA-AA-000-0000	58	1236	2009-07-08T05:50:04	disc
AA-AA-000-0000	128	1023	2016-04-23T01:48:05	disc
AA-AA-000-0000	155	1346	2018-10-23T05:37:28	disc
AA-AA-000-0000	43	0760	2004-10-14T06:54:46	disc
AA-AA-000-0000	127	0854	2011-09-30T11:51:19	disc
AA-AA-000-0000	71	0561	2015-12-15T02:17:47	disc

We continue to build another event table for visits and horizontally bind both event tables

df_vs <- rawplus_visdt %>%
  rename(date = visit_dt) %>%
  mutate(event = "visit") %>%
  # We ignore visits that have no date
  filter(! is.na(date)) %>%
  # We are not interested in same day visits
  distinct(studyid, siteid, subjid, date, event)

df_events <- bind_rows(df_vs, df_disc) %>%
  arrange(studyid, siteid, subjid, date)

df_events %>%
  filter(subjid == "0002") %>%
  knitr::kable()

studyid	siteid	subjid	date	event
AA-AA-000-0000	76	0002	2017-04-03	visit
AA-AA-000-0000	76	0002	2017-04-10	visit
AA-AA-000-0000	76	0002	2017-04-18	visit
AA-AA-000-0000	76	0002	2017-05-05	visit
AA-AA-000-0000	76	0002	2017-05-22	visit
AA-AA-000-0000	76	0002	2017-06-05T19:31:19	disc

We add the cumulative event counts and aggregate on visit level to prepare data in a format that is ready to use by {simaerep}.

df_visit <- df_events %>%
  mutate(
    cum_visit = cumsum(ifelse(event == "visit", 1, 0)),
    cum_disc = cumsum(ifelse(event == "disc", 1, 0)),
    .by = c("studyid", "siteid", "subjid")
  ) %>%
  # aggregate counts on visit level
  summarise(
    cum_disc = max(cum_disc),
    .by = c("studyid", "siteid", "subjid", "cum_visit")
  ) %>%
  # remove patients with 0 visits
  filter(max(cum_visit) > 0 , .by = c("studyid", "siteid", "subjid"))

df_visit %>%
  filter(subjid == "0002") %>%
  knitr::kable()

studyid	siteid	subjid	cum_visit	cum_disc
AA-AA-000-0000	76	0002	1	0
AA-AA-000-0000	76	0002	2	0
AA-AA-000-0000	76	0002	3	0
AA-AA-000-0000	76	0002	4	0
AA-AA-000-0000	76	0002	5	1

Sampling Correction

Notice that patient 0002 only has 5 entries, due to their early departure. The limited records of discontinued patients will create a problem with {simaerep}’s sampling algorithm, as the algorithm samples patients that have at least the same number of visits as the patient that is being replaced. This leads to an survivor bias where discontinued patients, since they have less visits, are unable to be used as replacement values for patients who had more visits.

To address this we would have to use the planned visits instead of the occurred visits, but {clindata} does not provide those.

As an approximation to the planned visits the discontinued patient’s records will be artificially inflated to 15 visits, a cut off point that includes roughly 80% of the patients not discontinued have reached during the study. This change allows for the proper sampling to occur and for the production of correct results.

subj_disc <- df_visit %>%
  filter(cum_disc == 1) %>%
  pull(subjid) %>%
  unique()

df_fill <- df_visit %>%
  distinct(studyid, siteid, subjid) %>%
  filter(subjid %in% subj_disc) %>%
  cross_join(
    tibble(
      cum_visit = seq(1, 15),
      disc_fill = 1
    )
  )

df_visit_disc <- df_visit %>%
  filter(subjid %in% subj_disc) %>%
  full_join(
    df_fill,
    by = c("studyid", "siteid" ,"subjid" ,"cum_visit")
  ) %>%
  mutate(
    cum_disc = coalesce(cum_disc, disc_fill)
  ) %>%
  select(- disc_fill) %>%
  arrange(subjid, cum_visit)

df_visit_not_disc <- df_visit %>%
  filter(! subjid %in% subj_disc)

df_visit_fill <- bind_rows(df_visit_disc, df_visit_not_disc) %>%
  rename(n_disc = cum_disc)

# Displays patient 0002, modified to 15 visits
df_visit_fill %>%
  filter(subjid == "0002") %>%
  arrange(cum_visit) %>%
  kable()

studyid	siteid	subjid	cum_visit	n_disc
AA-AA-000-0000	76	0002	1	0
AA-AA-000-0000	76	0002	2	0
AA-AA-000-0000	76	0002	3	0
AA-AA-000-0000	76	0002	4	0
AA-AA-000-0000	76	0002	5	1
AA-AA-000-0000	76	0002	6	1
AA-AA-000-0000	76	0002	7	1
AA-AA-000-0000	76	0002	8	1
AA-AA-000-0000	76	0002	9	1
AA-AA-000-0000	76	0002	10	1
AA-AA-000-0000	76	0002	11	1
AA-AA-000-0000	76	0002	12	1
AA-AA-000-0000	76	0002	13	1
AA-AA-000-0000	76	0002	14	1
AA-AA-000-0000	76	0002	15	1

{simaerep}

Since discontinuities are inherently more rare than AEs, it is necessary to run {simaerep} with a larger bootstrap iteration so that the results in between runs are more stable. The following code is run with 50,000 bootstrap repetitions, which is much higher than {simaereps}’s default of 1,000. This change allows the algorithm to provide a more stable and therefore more accurate model.

Here is the output of a {simaerep} call on df_visit modified for discontinuities.

discrep <- simaerep(
    df_visit_fill,
    r = 50000,
    event_names = "disc",
    col_names = list(
      study_id = "studyid",
      site_id = "siteid",
      patient_id = "subjid",
      visit = "cum_visit"
    )
  )

discrep

## simaerep object:
## ----------------
## Plot results using plot() generic.
## Full results available in "df_eval".
## 
## Summary:
## Number of sites: 176
## Number of studies: 1
## 
## Multiplicity correction applied to '*_prob' columns.
## 
## First 10 rows of df_eval:
## # A tibble: 10 × 10
##    studyid        siteid disc_count disc_per_visit_site visits n_pat
##    <chr>          <chr>       <dbl>               <dbl>  <dbl> <int>
##  1 AA-AA-000-0000 10              2             0.00494    405    20
##  2 AA-AA-000-0000 100             0             0           41     2
##  3 AA-AA-000-0000 101             0             0           63     3
##  4 AA-AA-000-0000 102             0             0           68     3
##  5 AA-AA-000-0000 103             0             0           86     4
##  6 AA-AA-000-0000 104             2             0.0106     188     9
##  7 AA-AA-000-0000 105             0             0           65     3
##  8 AA-AA-000-0000 106             0             0           69     3
##  9 AA-AA-000-0000 107             0             0           63     3
## 10 AA-AA-000-0000 109             0             0           89     4
## # ℹ 4 more variables: disc_per_visit_study <dbl>, disc_prob_no_mult <dbl>,
## #   disc_prob <dbl>, disc_delta <dbl>

plot(discrep)

## study = NULL, defaulting to study:AA-AA-000-0000

Top 100 out of 176 sites by discontinuation probability without multiplicity correction in descending order.

discrep$df_eval %>%
  mutate(disc_ratio_per_pat = disc_count / n_pat) %>%
  select(
    siteid,
    disc_count,
    visits,
    n_pat,
    disc_ratio_per_pat,
    disc_per_visit_site,
    disc_per_visit_study,
    disc_prob_no_mult,
    disc_prob,
    disc_delta
  ) %>%
  arrange(desc(disc_prob_no_mult)) %>%
  mutate(rank = row_number(), .before = siteid) %>%
  head(100) %>%
  knitr::kable(digits = 3)

rank	siteid	disc_count	visits	n_pat	disc_ratio_per_pat	disc_per_visit_site	disc_per_visit_study	disc_prob_no_mult	disc_prob	disc_delta
1	28	4	99	5	0.800	0.040	0.003	1.000	0.993	3.746
2	166	6	470	20	0.300	0.013	0.002	1.000	0.986	5.141
3	60	3	78	4	0.750	0.038	0.002	1.000	0.986	2.817
4	161	3	89	5	0.600	0.034	0.002	0.999	0.975	2.787
5	26	3	143	7	0.429	0.021	0.003	0.996	0.879	2.641
6	13	3	158	8	0.375	0.019	0.002	0.996	0.879	2.639
7	115	2	54	3	0.667	0.037	0.002	0.995	0.874	1.868
8	77	4	365	17	0.235	0.011	0.002	0.992	0.823	3.172
9	52	2	53	3	0.667	0.038	0.003	0.991	0.821	1.833
10	15	2	108	5	0.400	0.019	0.002	0.985	0.757	1.799
11	43	5	719	33	0.152	0.007	0.002	0.985	0.757	3.536
12	165	1	16	1	1.000	0.062	0.001	0.977	0.715	0.977
13	154	1	16	1	1.000	0.062	0.001	0.977	0.715	0.977
14	89	2	127	6	0.333	0.016	0.002	0.974	0.715	1.735
15	114	2	161	7	0.286	0.012	0.002	0.973	0.715	1.734
16	174	2	124	6	0.333	0.016	0.002	0.973	0.715	1.720
17	159	1	17	1	1.000	0.059	0.002	0.971	0.715	0.971
18	48	1	18	1	1.000	0.056	0.002	0.970	0.715	0.970
19	170	4	626	27	0.148	0.006	0.002	0.969	0.715	2.790
20	175	1	20	1	1.000	0.050	0.002	0.968	0.715	0.968
21	190	2	154	7	0.286	0.013	0.002	0.962	0.681	1.672
22	34	2	172	8	0.250	0.012	0.002	0.951	0.620	1.633
23	104	2	188	9	0.222	0.011	0.002	0.948	0.620	1.620
24	71	2	217	10	0.200	0.009	0.002	0.948	0.620	1.620
25	67	2	223	11	0.182	0.009	0.002	0.944	0.608	1.610
26	145	1	15	1	1.000	0.067	0.004	0.941	0.600	0.941
27	50	2	192	8	0.250	0.010	0.002	0.936	0.581	1.570
28	46	1	36	2	0.500	0.028	0.003	0.910	0.453	0.909
29	182	1	37	2	0.500	0.027	0.003	0.902	0.453	0.899
30	185	1	48	2	0.500	0.021	0.002	0.889	0.453	0.885
31	3	1	39	2	0.500	0.026	0.003	0.888	0.453	0.885
32	130	1	63	3	0.333	0.016	0.002	0.885	0.453	0.880
33	116	1	57	3	0.333	0.018	0.002	0.881	0.453	0.877
34	16	1	56	3	0.333	0.018	0.002	0.881	0.453	0.876
35	140	5	1456	69	0.072	0.003	0.002	0.881	0.453	2.382
36	179	1	106	4	0.250	0.009	0.001	0.879	0.453	0.874
37	64	1	82	4	0.250	0.012	0.002	0.879	0.453	0.874
38	53	1	65	3	0.333	0.015	0.002	0.877	0.453	0.872
39	144	1	80	4	0.250	0.013	0.002	0.875	0.453	0.870
40	173	3	602	27	0.111	0.005	0.002	0.873	0.453	1.752
41	40	1	57	3	0.333	0.018	0.002	0.873	0.453	0.867
42	96	1	82	4	0.250	0.012	0.002	0.869	0.452	0.862
43	62	3	711	34	0.088	0.004	0.002	0.865	0.450	1.727
44	91	2	366	16	0.125	0.005	0.002	0.859	0.450	1.336
45	168	1	59	3	0.333	0.017	0.003	0.859	0.450	0.852
46	81	1	78	4	0.250	0.013	0.002	0.856	0.450	0.849
47	124	1	85	4	0.250	0.012	0.002	0.849	0.434	0.840
48	163	1	72	3	0.333	0.014	0.002	0.839	0.412	0.831
49	10	2	405	20	0.100	0.005	0.002	0.836	0.412	1.277
50	127	1	102	5	0.200	0.010	0.002	0.832	0.409	0.819
51	25	1	66	3	0.333	0.015	0.003	0.828	0.408	0.818
52	135	1	109	5	0.200	0.009	0.002	0.823	0.407	0.809
53	128	1	95	5	0.200	0.011	0.002	0.822	0.407	0.808
54	39	1	92	4	0.250	0.011	0.002	0.814	0.393	0.799
55	131	1	141	7	0.143	0.007	0.001	0.808	0.387	0.790
56	86	2	477	22	0.091	0.004	0.002	0.798	0.366	1.169
57	184	1	104	5	0.200	0.010	0.002	0.788	0.345	0.767
58	58	2	459	20	0.100	0.004	0.002	0.780	0.337	1.119
59	29	1	142	7	0.143	0.007	0.002	0.778	0.337	0.752
60	134	1	130	6	0.167	0.008	0.002	0.773	0.333	0.748
61	186	1	119	5	0.200	0.008	0.002	0.751	0.281	0.722
62	157	1	155	7	0.143	0.006	0.002	0.733	0.252	0.698
63	83	1	142	7	0.143	0.007	0.002	0.732	0.252	0.696
64	59	1	180	8	0.125	0.006	0.002	0.714	0.213	0.670
65	4	1	196	9	0.111	0.005	0.002	0.668	0.102	0.606
66	155	1	230	10	0.100	0.004	0.002	0.648	0.060	0.576
67	51	1	237	11	0.091	0.004	0.002	0.615	0.000	0.524
68	94	2	768	36	0.056	0.003	0.002	0.608	0.000	0.655
69	69	1	271	12	0.083	0.004	0.002	0.574	0.000	0.459
70	63	1	240	10	0.100	0.004	0.002	0.570	0.000	0.455
71	92	1	312	15	0.067	0.003	0.002	0.561	0.000	0.432
72	76	1	273	12	0.083	0.004	0.002	0.547	0.000	0.414
73	167	1	321	15	0.067	0.003	0.002	0.536	0.000	0.393
74	61	1	346	15	0.067	0.003	0.002	0.460	0.000	0.245
75	54	1	390	17	0.059	0.003	0.002	0.436	0.000	0.189
76	172	1	601	28	0.036	0.002	0.002	0.343	0.000	-0.053
77	56	1	545	25	0.040	0.002	0.002	0.334	0.000	-0.071
78	143	1	601	28	0.036	0.002	0.002	0.318	0.000	-0.120
79	132	0	34	1	0.000	0.000	0.000	0.000	0.000	0.000
80	171	0	33	1	0.000	0.000	0.000	0.000	0.000	0.000
81	126	0	21	1	0.000	0.000	0.001	-0.031	0.000	-0.031
82	14	0	21	1	0.000	0.000	0.001	-0.031	0.000	-0.031
83	42	0	21	1	0.000	0.000	0.001	-0.031	0.000	-0.031
84	36	0	21	1	0.000	0.000	0.001	-0.031	0.000	-0.031
85	33	0	21	1	0.000	0.000	0.002	-0.032	0.000	-0.032
86	158	0	21	1	0.000	0.000	0.002	-0.032	0.000	-0.032
87	45	0	21	1	0.000	0.000	0.002	-0.032	0.000	-0.032
88	99	0	20	1	0.000	0.000	0.002	-0.032	0.000	-0.032
89	180	0	21	1	0.000	0.000	0.002	-0.032	0.000	-0.032
90	38	0	21	1	0.000	0.000	0.002	-0.032	0.000	-0.032
91	20	0	20	1	0.000	0.000	0.002	-0.032	0.000	-0.032
92	110	0	21	1	0.000	0.000	0.002	-0.033	0.000	-0.033
93	6	0	19	1	0.000	0.000	0.002	-0.036	0.000	-0.036
94	78	0	22	1	0.000	0.000	0.002	-0.041	0.000	-0.041
95	55	0	22	1	0.000	0.000	0.002	-0.042	0.000	-0.042
96	87	0	22	1	0.000	0.000	0.002	-0.042	0.000	-0.042
97	66	0	22	1	0.000	0.000	0.002	-0.042	0.000	-0.042
98	123	0	22	1	0.000	0.000	0.002	-0.043	0.000	-0.043
99	65	0	22	1	0.000	0.000	0.002	-0.043	0.000	-0.043
100	117	0	23	1	0.000	0.000	0.002	-0.045	0.000	-0.045

We find 4 sites that have a high reporting probability (>= 99% w/o multiplicity correction and >= 95% with) for discontinuations. As is expected the discontinuation ratio per patient which is a common monitoring metric is a poor indicator for which sites have high over-reporting probability as ratios from small samples tend to have more variability and are more likely to have outlier.

Summary

In order to use {simaerep} for binary events such as patient discontinuation we have to account for survival bias by using planned visits rather than actual visits. Then {simaerep} can produce very plausible over-reporting probabilities with the same advantages we expect {simaerep} to have over parametric statistical methods for other event reporting metrics.