The simdata dataset
Matthias Kormaksson, Kostas Sechidis
April 13, 2026
simdata.RmdIntroduction
The knockofftools package comes with a simulated data set called
simdata. This short vignette demonstrates how the data set
was generated.
Data generation with generate_simdata
The function generate_simdata was used to generate the
data set simdata that comes with the R-package.
generate_simdata <- function() {
RNGkind("L'Ecuyer-CMRG")
set.seed(56969)
N <- 2000
p <- 30
p_b = 10
p_nn <- 10
# Generate a 2000 x 30 Gaussian data.frame under equi-correlation(rho=0.5) structure,
# with 10 of the columns dichotomized
X <- generate_X(n=N, p=p, p_b=p_b, cov_type = "cov_equi", rho=0.5)
# Generate linear predictor lp = X%*%beta where first 10 beta-coefficients are = a, all other = 0.
lp <- generate_lp(X=X, p_nn=p_nn, a=1)
# simulate gaussian response with mean = lp and sd = 1.
Yg <- lp + rnorm(N)
# simulate bernoulli response with mean = exp(lp)/(1+exp(lp)).
Yb <- factor(rbinom(N, size=1, prob=exp(lp)/(1+exp(lp))))
# simulate censored survival times from Cox regression with linear predictor lp:
Tc <- simulWeib(N=N, lambda = 0.01, rho = 1, lp = lp)
dat <- data.frame(Yg, Yb, Tc, X)
return(dat)
}This function simulates a toy data set with 30 covariates , one continuous response , one binary response and one set of censored survival times . Let’s now go through the individual components of the function one by one.
Simulation of
:
The generate_X function simulates the rows of an
data frame
independently from a multivariate Gaussian distribution with mean
and
covariance matrix
where
randomly selected columns are then dichotomized with the indicator
function
.
X <- generate_X(n=N, p=p, p_b=p_b, cov_type = "cov_equi", rho=0.5)The covariance type is specified with the parameter
cov_type and the correlation coefficient with
rho. Each column of the resulting data.frame is either of
class "numeric" (for the continuous columns) or
"factor" (for the binary columns).
Calculation of linear predictor: The
generate_lp function calculates the linear predictor
under sparsity, where the first p_nn regression
coefficients are non-zero, all other are set to zero. The (common)
amplitude of the non-zero regression coefficients is specified with
a. Here we generate
that implies association with the first 10 covariates, each with
amplitude
.
lp <- generate_lp(X=X, p_nn=p_nn, a=1)Note that inside generate_lp the model.matrix of
X is first scaled.
Simulation of Gaussian response: is Gaussian with mean and standard deviation :
Yg <- lp + rnorm(N)Simulation of Bernoulli response: is Bernoulli with success probability :
Simulation of event and censoring times: The final command:
Tc <- simulWeib(N=N, lambda = 0.01, rho = 1, lp = lp)generates censored survival times
from a Cox regression model
with Weibull baseline hazard:
where
and
are scale and shape parameters, respectively. The censoring times
are randomly drawn from an exponential distribution with a small (fixed)
rate
,
which results in very mild censoring. Once
and
have been simulated the function returns a survival object
(Surv) with time = min(T, C) and
event = 1{T < C}:
The data set simdata
Now let’s have a look at the first few columns of the data set
simdata
data(simdata)
head(simdata[,1:9])
#> Yg Yb Tc.time Tc.status X1 X2 X3 X4 X5 X6
#> 1 -8.616078 0 2961.163522 0.000000 0 -1.34958449 1 -2.06459293 0 0
#> 2 2.145781 1 14.818694 1.000000 0 1.17359743 1 0.84017766 1 0
#> 3 2.855925 1 5.662159 1.000000 0 0.06617883 1 0.56828582 1 0
#> 4 -7.393736 0 780.165789 0.000000 0 0.60006023 0 -0.88344294 0 0
#> 5 -5.357337 0 1290.810345 0.000000 0 0.68195033 0 -0.07230342 0 1
#> 6 3.720681 1 3.590458 1.000000 1 -0.65766424 1 0.46974365 0 1and finally confirm that generate_simdata() indeed
reproduces the simdata dataset:
all.equal(simdata, generate_simdata())
#> [1] TRUE