Skip to contents

Introduction

In this vignette we demonstrate how to do knockoff variable selection in the context of censored time-to-event data. We consider the following Cox regression model for knockoff variable selection for time-to-event data:

λ(t)=λ0(t)exp(Xβ+X̃β̃),\begin{align} \lambda(t) = \lambda_0(t) \exp \left(X \beta + \tilde{X} \tilde{\beta} \right), \end{align}

where λ(t)\lambda(t) is the hazard function corresponding to a censored time-to-event response TT and λ0(t)\lambda_0(t) is the baseline hazard. The matrix XX is the design matrix corresponding to the covariates of interest, while X̃\tilde{X} represents a knockoff copy of XX.

Knockoff variable selection in Cox regression involves the following steps:

  1. Simulate a knockoff copy X̃\tilde{X} of the original covariates data XX.
  2. Compute feature statistics Wj=|βj||β̃j|W_j=|\beta_j|-|\tilde{\beta}_j| from the aggregated Cox regression with covariates XX and X̃\tilde{X}. Large, positive statistics WjW_j indicate effect of XjX_j on hazard rate.
  3. For FDR control use the knockoffs++ procedure to select variables jj that fulfill Wjτ+W_j \geq \tau_+ where τ+=argmint>0{1+|{j:Wjt}||{j:Wjt}|q}. \tau_+ = \underset{t>0}{\operatorname{argmin}} \left\{\frac{1 + |\{j : W_j \leq t\}|}{|\{j : W_j \leq t\}|} \leq q\right\}. This workflow selects variables associated with response with guaranteed control of false discovery rate FDRqFDR \leq q.

Data generation

In this section we will simulate a toy data set with a known truth. In particular, we will simulate a set of covariates XX and survival times under censoring that are associated with some of the XX’s through a Cox regression model.

Simulation of XX: The generate_X function simulates the rows of an n×pn \times p data frame XX independently from a multivariate Gaussian distribution with mean 00 and p×pp \times p covariance matrix Σij={1{i=j},Independent,ρ1{ij},Equicorrelated,ρ|ij|,AR1, \Sigma_{ij} = \left \{ \begin{array}{lr} 1\{i = j\}, & \text{Independent,} \\ \rho^{1\{i \neq j\}}, & \text{Equicorrelated,} \\ \rho^{|i-j|}, & \text{AR1}, \end{array} \right. where pbp_b randomly selected columns are then dichotomized with the indicator function δ(x)=1(x>0)\delta(x)=1(x > 0).

# Generate a 2000 x 30 Gaussian data.frame under equi-correlation(rho=0.5) structure, 
# with 10 of the columns  dichotomized
set.seed(1)
X <- generate_X(n=2000, p=30, p_b=10, cov_type = "cov_equi", rho=0.5)

The covariance type is specified with the parameter cov_type and the correlation coefficient with rho. Each column of the resulting data.frame is either of class "numeric" (for the continuous columns) or "factor" (for the binary columns).

Simulation of event and censoring times: The function simulWeib simulates survival times TT from a Cox regression model

λ(t)=λ0(t)exp(Xβ), \lambda(t) = \lambda_0(t) \exp \left(X \beta \right), with Weibull baseline hazard: λ0(t)=λ0ρtρ1 \lambda_0(t) = \lambda_0 \rho t^{\rho-1} where λ0>0\lambda_0 > 0 and ρ>0\rho > 0 are scale and shape parameters, respectively. The censoring times CC are randomly drawn from an exponential distribution with a small (fixed) rate λC=0.0005\lambda_C=0.0005, which results in very mild censoring. Once TT and CC have been simulated the function sets time = min(T, C) and event = 1{T < C}:

simulWeib <- function(N, lambda0, rho, lp) {
    lambdaC = 5e-04
    v <- runif(n = N)
    Tlat <- (-log(v)/(lambda0 * exp(lp)))^(1/rho)
    C <- rexp(n = N, rate = lambdaC)
    time <- pmin(Tlat, C)
    status <- as.numeric(Tlat <= C)
    survival::Surv(time = time, event = status)
}

In this toy data set we assume that the first five covariates affect the hazard λ(t)\lambda(t) through the linear predictor p=Xβ\ell_p=X\beta and then simulate the survival object:

lp <- generate_lp(X, p_nn=5, a=2)
set.seed(2)
y <- simulWeib(N=2000, lambda = 0.01, rho = 1, lp = lp)

Knockoff (feature) statistics in Cox regression

Let’s now use the knockoff.statistics function to (a) simulate independently MM knockoffs X̃k\tilde{X}_k, k=1,,Mk=1,\dots,M, (b) fit the aggregated Cox model to above data: λ(t)=λ0(t)exp(Xβ(k)+X̃kβ̃(k)),\begin{align} \lambda(t) = \lambda_0(t) \exp \left(X \beta^{(k)} + \tilde{X}_k \tilde{\beta}^{(k)} \right), \end{align} for each of the k=1,,Mk=1,\dots,M knockoff copies. Then finally calculate (for each knockoff kk) the corresponding knockoff (feature) statistics Wj(k)=|βj(k)||β̃j(k)|W^{(k)}_j = |\beta_j^{(k)}|-|\tilde{\beta}_j^{(k)}|.

set.seed(123)
W <- knockoff.statistics(y, X, type="survival", M=10)
head(W)
#>             W1          W2         W3        W4         W5          W6
#> X1  1.68417085  1.64920854 1.71294466 1.7130229 1.72623754  1.76959115
#> X2  1.78733106  1.79439991 1.82809127 1.7953936 1.80803144  1.83159085
#> X3  1.74840772  1.75951906 1.80838883 1.7588122 1.75238267  1.80715192
#> X4  1.72546124  1.67708322 1.78064956 1.7258980 1.74789299  1.73426786
#> X5  1.75046697  1.74950657 1.82076525 1.7650704 1.78235237  1.83577773
#> X6 -0.01921297 -0.01275992 0.01461379 0.0156508 0.01742504 -0.01154548
#>            W7         W8         W9          W10
#> X1 1.73428009 1.72061443 1.73521643  1.746217213
#> X2 1.80838108 1.77398820 1.80714829  1.804438638
#> X3 1.76401959 1.74515206 1.77405806  1.829114226
#> X4 1.72333722 1.71313170 1.72602257  1.779804692
#> X5 1.78468873 1.77687033 1.79819418  1.805816339
#> X6 0.00387067 0.01255798 0.01473674 -0.003842216

The output is a data.frame whose columns are the MM knockoff statistics. In the above we have used (as a default) parallel computing via the package clustermq (see: User Guide - clustermq).

Variable selection via multiple knockoffs

We simulated M=10M = 10 feature statistics in order to evaluate robustness of the knockoff variable selection process, each time simulating a new knockoff of X. To calculate which variables will be selected for each of these knockoff statistics we use the variable.selections function. This function takes the knockoff statistics W as input and additionally specifies the error.type="fdr" (default), "pfer" (per family error error), or "kfwer" (k-familywise error rate) and the corresponding target error level.

S = variable.selections(W, level = 0.2, error.type="fdr")
head(S$selected)
#>    S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
#> X1  1  1  1  1  1  1  1  1  1   1
#> X2  1  1  1  1  1  1  1  1  1   1
#> X3  1  1  1  1  1  1  1  1  1   1
#> X4  1  1  1  1  1  1  1  1  1   1
#> X5  1  1  1  1  1  1  1  1  1   1
#> X6  0  0  0  1  0  0  0  0  0   0
S$stable.variables
#> [1] "X1" "X2" "X3" "X4" "X5"

In a nutshell the function will both calculate individual variable selections (for each knockoff replicate) and a stable set of variables that are selected most frequently among the M replicates. For FDR-control the stable variables are calculated using the heuristics in The multiple knockoffs filter procedure, while for the other two error types ("pfer" and "kfwer") we simply choose all variables selected more than thres*M times, where thres is an optional parameter of the variable.selections function (thres=0.5 by default).

Heatmap of multiple variable selections

In order to evaluate the robustness of the knockoff selection procedure we can visualize a heatmap of the selections across the knockoff replicates.

plot(S)
#> Warning: Vectorized input to `element_text()` is not officially supported.
#>  Results may be unexpected or may change in future versions of ggplot2.

We apply co-clustering of the rows and columns of the heatmap, which identifies blocks of similarities in the binary selections. We then order the blocks according to increasing mean number of selections. This helps with aesthetics, but also helps visualize the most important variables that should tend towards the top of the heatmap.

Appendix

A. The multiple knockoffs filter procedure for FDR-control

(Kormaksson et al. 2021) introduced a heuristic algorithm for selecting stable variables from multiple independent knockoff variable selections.

Let X̃1,,X̃B\tilde{X}_1, \dots, \tilde{X}_B denote BB independent knockoff copies of XX. For each knockoff copy bb run the knockoff filter and select the set of influential variables, Sb{1,,p}S_b \subseteq \{1,\dots,p\}. We propose the following heuristics to select a final set of variables:

  • Let F(r){1,,p}F(r) \subseteq \{1,\dots,p\}, where r[0.5,1]r \in [0.5, 1], denote the set of variables selected more than rBr \cdot B times out of the BB knockoff draws.
  • Let $S(r)=\underset{b}{\rm mode}\{F(r) \cap S_b\}$ denote the set of selected variables that appears most frequently, after filtering out variables that are not in F(r)F(r).
  • Return Ŝ=S(r̂)\hat{S} = S(\hat{r}), where r̂=argmaxr0.5|S(r)|\hat{r} = \underset{r \geq 0.5}{\operatorname{argmax}} |{S(r)}|, i.e. the largest set among {S(r):r0.5}\{S(r):r \geq 0.5\}.

The first step above essentially filters out variables that don’t appear more than (100r)%(100\cdot r)\% of the time, which would seem like a reasonable requirement in practice (e.g. with r=0.5r=0.5). The second step above then filters the BB knockoff selections SbS_b accordingly and searches for the most frequent variable set among those. The third step then establishes the final selection, namely the most liberal variable selection among the sets {S(r):r0.5}\{S(r):r \geq 0.5\}.

References

Kormaksson, M., L. J. Kelly, X. Zhu, S. Haemmerle, L. Pricop, and D. Ohlssen. 2021. “Sequential Knockoffs for Continuous and Categorical Predictors: With Application to a Large Psoriatic Arthritis Clinical Trial Pool.” Statistics in Medicine.