Details of the Kenward-Roger calculations

Here we describe the details of the calculations for the Kenward-Roger degrees of freedom and the adjusted covariance matrix of the coefficients.

Model definition

The model definition is the same as what we have in Details of the model fitting in mmrm. We are using the same notations.

Linear model

For each subject $i$ we observe a vector \[ Y_i = (y_{i1}, \dotsc, y_{im_i})^\top \in \mathbb{R}^{m_i} \] and given a design matrix \[ X_i \in \mathbb{R}^{m_i \times p} \] and a corresponding coefficient vector $\beta \in \mathbb{R}^{p}$ we assume that the observations are multivariate normal distributed: \[ Y_i \sim N(X_i\beta, \Sigma_i) \] where the covariance matrix $\Sigma_i \in \mathbb{R}^{m_i \times m_i}$ is derived by subsetting the overall covariance matrix $\Sigma \in \mathbb{R}^{m \times m}$ appropriately by \[ \Sigma_i = G_i^{-1/2} S_i^\top \Sigma S_i G_i^{-1/2} \] where the subsetting matrix $S_i \in \{0, 1\}^{m \times m_i}$ contains in each of its $m_i$ columns contains a single 1 indicating which overall time point is matching $t_{ih}$. $G_i \in \mathbb{R}_{\gt 0}^{m_i \times m_i}$ is the diagonal weight matrix.

Conditional on the design matrices $X_i$, the coefficient vector $\beta$ and the covariance matrix $\Sigma$ we assume that the observations are independent between the subjects.

We can write the linear model for all subjects together as \[ Y = X\beta + \epsilon \] where $Y \in \mathbb{R}^N$ combines all subject specific observations vectors $Y_i$ such that we have in total $N = \sum_{i = 1}^{n}{m_i}$ observations, $X \in \mathbb{R}^{N \times p}$ combines all subject specific design matrices and $\epsilon \in \mathbb{R}^N$ has a multivariate normal distribution \[ \epsilon \sim N(0, \Omega) \] where $\Omega \in \mathbb{R}^{N \times N}$ is block-diagonal containing the subject specific $\Sigma_i$ covariance matrices on the diagonal and 0 in the remaining entries.

Mathematical Details of Kenward-Roger method

The mathematical derivation of the Kenward-Roger method is based on the Taylor expansion of the obtained covariance matrix of $\hat\beta$ to get a more accurate estimate for it. All these derivations are based on the restricted maximum likelihood. Following the same notation, the covariance matrix, $\Omega$ can be represented as a function of covariance matrix parameters $\theta = (\theta_1, \dotsc, \theta_k)^\top$, i.e. $\Omega(\theta)$. Here after model fitting with mmrm, we obtain the estimate $\hat\beta = \Phi(\hat\theta)X^\top\Omega(\hat\theta)^{-1}Y$, where $\Phi(\theta) = \left\{X^\top \Omega(\theta)^{-1} X\right\} ^{-1}$ is the asymptotic covariance matrix of $\hat\beta$. However, Kackar and Harville (1984) suggests that although the $\hat\beta$ is unbiased for $\beta$, the covariance matrix, $\hat\Phi = \left\{X^\top \hat\Omega X\right\}^{-1}$ can be biased. They showed that the variability of $\hat\beta$ can be partitioned into two components,

\[ \Phi_A = \Phi + \Lambda \]

where $\Phi$ is the variance-covariance matrix of the asymptotic distribution of $\hat\beta$ as $n\rightarrow \infty$ as defined above, and $\Lambda$ represents the amount to which the asymptotic variance-covariance matrix underestimates $\Phi_A$.

Based on a Taylor series expansion around $\theta$, $\Lambda$ can be approximated by

\[ \Lambda \simeq \Phi \left\{\sum_{h=1}^k{\sum_{j=1}^k{W_{hj}(Q_{hj} - P_h \Phi P_j)} }\right\} \Phi \] where \[ P_h = X^\top \frac{\partial{\Omega^{-1}}}{\partial \theta_h} X \] \[ Q_{hj} = X^\top \frac{\partial{\Omega^{-1}}}{\partial \theta_h} \Omega \frac{\partial{\Omega^{-1}}}{\partial \theta_j} X \]

$W$ is the inverse of the Hessian matrix of the log-likelihood function of $\theta$ evaluated at the estimate $\hat\theta$, i.e. the observed Fisher Information matrix as a consistent estimator of the variance-covariance matrix of $\hat\theta$.

Again, based on a Taylor series expansion about $\theta$, Kenward and Roger (1997) show that \[ \hat\Phi \simeq \Phi + \sum_{h=1}^k{(\hat\theta_h - \theta_h)\frac{\partial{\Phi}}{\partial{\theta_h}}} + \frac{1}{2} \sum_{h=1}^k{\sum_{j=1}^k{(\hat\theta_h - \theta_h)(\hat\theta_j - \theta_j)\frac{\partial^2{\Phi}}{\partial{\theta_h}\partial{\theta_j}}}} \] Ignoring the possible bias in $\hat\theta$, \[ E(\hat\Phi) \simeq \Phi + \frac{1}{2} \sum_{h=1}^k{\sum_{j=1}^k{W_{hj}\frac{\partial^2{\Phi}}{\partial{\theta_h}\partial{\theta_j}}}} \] Using previously defined notations, this can be further written as \[ \frac{\partial^2{\Phi}}{\partial{\theta_h}\partial{\theta_j}} = \Phi (P_h \Phi P_j + P_j \Phi P_h - Q_{hj} - Q_{jh} + R_{hj}) \Phi \] where \[ R_{hj} = X^\top\Omega^{-1}\frac{\partial^2\Omega}{\partial{\theta_h}\partial{\theta_j}} \Omega^{-1} X \]

substituting $\Phi$ and $\Lambda$ back to the $\hat\Phi_A$, we have

\[ \hat\Phi_A = \hat\Phi + 2\hat\Phi \left\{\sum_{h=1}^k{\sum_{j=1}^k{W_{hj}(Q_{hj} - P_h \hat\Phi P_j - \frac{1}{4}R_{hj})} }\right\} \hat\Phi \]

where $\Omega(\hat\theta)$ replaces $\Omega(\theta)$ in the right-hand side.

Please note that, if we ignore $R_{hj}$, the second-order derivatives, we will get a different estimate of adjusted covariance matrix, and we call this the linear Kenward-Roger approximation.

Special Considerations for mmrm models

In mmrm models, $\Omega$ is a block-diagonal matrix, hence we can calculate $P$, $Q$ and $R$ for each subject and add them up.

\[ P_h = \sum_{i=1}^{N}{P_{ih}} = \sum_{i=1}^{N}{X_i^\top \frac{\partial{\Sigma_i^{-1}}}{\partial \theta_h} X_i} \]

\[ Q_{hj} = \sum_{i=1}^{N}{Q_{ihj}} = \sum_{i=1}^{N}{X_i^\top \frac{\partial{\Sigma_i^{-1}}}{\partial \theta_h} \Sigma_i \frac{\partial{\Sigma_i^{-1}}}{\partial \theta_j} X_i} \]

\[ R_{hj} = \sum_{i=1}^{N}{R_{ihj}} = \sum_{i=1}^{N}{X_i^\top\Sigma_i^{-1}\frac{\partial^2\Sigma_i}{\partial{\theta_h}\partial{\theta_j}} \Sigma_i^{-1} X_i} \]

Derivative of the overall covariance matrix $\Sigma$

The derivative of the overall covariance matrix $\Sigma$ with respect to the variance parameters can be calculated through the derivatives of the Cholesky factor, and hence obtained through automatic differentiation, following matrix identities calculations. \[ \frac{\partial{\Sigma}}{\partial{\theta_h}} = \frac{\partial{LL^\top}}{\partial{\theta_h}} = \frac{\partial{L}}{\partial{\theta_h}}L^\top + L\frac{\partial{L^\top}}{\partial{\theta_h}} \]

\[ \frac{\partial^2{\Sigma}}{\partial{\theta_h}\partial{\theta_j}} = \frac{\partial^2{L}}{\partial{\theta_h}\partial{\theta_j}}L^\top + L\frac{\partial^2{L^\top}}{\partial{\theta_h}\partial{\theta_j}} + \frac{\partial{L}}{\partial{\theta_h}}\frac{\partial{L^T}}{\partial{\theta_j}} + \frac{\partial{L}}{\partial{\theta_j}}\frac{\partial{L^\top}}{\partial{\theta_h}} \]

Derivative of the $\Sigma^{-1}$

The derivatives of $\Sigma^{-1}$ can be calculated through

\[ \frac{\partial{\Sigma\Sigma^{-1}}}{\partial{\theta_h}}\\ = \frac{\partial{\Sigma}}{\partial{\theta_h}}\Sigma^{-1} + \Sigma\frac{\partial{\Sigma^{-1}}}{\partial{\theta_h}} \\ = 0 \] \[ \frac{\partial{\Sigma^{-1}}}{\partial{\theta_h}} = - \Sigma^{-1} \frac{\partial{\Sigma}}{\partial{\theta_h}}\Sigma^{-1} \]

Subjects with missed visits

If a subject do not have all visits, the corresponding covariance matrix can be represented as \[ \Sigma_i = S_i^\top \Sigma S_i \]

and the derivatives can be obtained through

\[ \frac{\partial{\Sigma_i}}{\partial{\theta_h}} = S_i^\top \frac{\partial{\Sigma}}{\partial{\theta_h}} S_i \]

\[ \frac{\partial^2{\Sigma_i}}{\partial{\theta_h}\partial{\theta_j}} = S_i^\top \frac{\partial^2{\Sigma}}{\partial{\theta_h}\partial{\theta_j}} S_i \]

The derivative of the $\Sigma_i^{-1}$, $\frac{\partial\Sigma_i^{-1}}{\partial{\theta_h}}$ can be calculated through $\Sigma_i^{-1}$ and $\frac{\partial{\Sigma_i}}{\partial{\theta_h}}$ using the above.

Scenario under group specific covariance estimates

When fitting grouped mmrm models, the covariance matrix for subject i of group $g(i)$, can be written as \[ \Sigma_i = S_i^\top \Sigma_{g(i)} S_i$. \] Assume there are $B$ groups, the number of parameters is increased by $B$ times. With the fact that for each group, the corresponding $\theta$ will not affect other parts, we will have block-diagonal $P$, $Q$ and $R$ matrices, where the blocks are given by:

\[ P_h = \begin{pmatrix} P_{h, 1} & \dots & P_{h, B} \\ \end{pmatrix} \]

\[ Q_{hj} = \begin{pmatrix} Q_{hj, 1} & 0 & \dots & \dots & 0 \\ 0 & Q_{hj, 2} & 0 & \dots & 0\\ \vdots & & \ddots & & \vdots \\ \vdots & & & \ddots & \vdots \\ 0 & \dots & \dots & 0 & Q_{hj, B} \end{pmatrix} \]

\[ R_{hj} = \begin{pmatrix} R_{hj, 1} & 0 & \dots & \dots & 0 \\ 0 & R_{hj, 2} & 0 & \dots & 0\\ \vdots & & \ddots & & \vdots \\ \vdots & & & \ddots & \vdots \\ 0 & \dots & \dots & 0 & R_{hj, B} \end{pmatrix} \]

Use $P_{j, b}$ to denote the block diagonal part for group $b$, we have \[ P_{h,b} = \sum_{g(i) = b}{P_{ih}} = \sum_{g(i) = b}{X_i^\top \frac{\partial{\Sigma_i^{-1}}}{\partial \theta_h} X_i} \]

\[ Q_{hj,b} = \sum_{g(i) = b}{Q_{ihj}} = \sum_{g(i) = b}{X_i^\top \frac{\partial{\Sigma_i^{-1}}}{\partial \theta_h} \Sigma_i \frac{\partial{\Sigma_i^{-1}}}{\partial \theta_j} X_i} \]

\[ R_{hj,b} = \sum_{g(i) = b}{R_{ihj}} = \sum_{g(i) = b}{X_i^\top\Sigma_i^{-1}\frac{\partial^2\Sigma_i}{\partial{\theta_h}\partial{\theta_j}} \Sigma_i^{-1} X_i} \]

Similarly for $R$.

Scenario under weighted mmrm

Under weights mmrm model, the covariance matrix for subject $i$, can be represented as

\[ \Sigma_i = G_i^{-1/2} S_i^\top \Sigma S_i G_i^{-1/2} \]

Where $G_i$ is a diagonal matrix of the weights. Then, when deriving $P$, $Q$ and $R$, there are no mathematical differences as they are constant, and having $G_i$ in addition to $S_i$ does not change the algorithms and we can simply multiply the formulas with $G_i^{-1/2}$, similarly as above for the subsetting matrix.

Inference

Suppose we are testing the linear combination of $\beta$, $C\beta$ with $C \in \mathbb{R}^{c\times p}$, we can use the following F-statistic \[ F = \frac{1}{c} (\hat\beta - \beta)^\top C (C^\top \hat\Phi_A C)^{-1} C^\top (\hat\beta - \beta) \] and \[ F^* = \lambda F \] follows exact $F_{c,m}$ distribution.

When we have only one coefficient to test, then $C$ is a column vector containing a single $1$ inside. We can still follow the same calculations as for the multi-dimensional case. This recovers the degrees of freedom results of the Satterthwaite method.

$\lambda$ and $m$ can be calculated through

\[ M = C (C^\top \Phi C)^{-1} C^\top \]

\[ A_1 = \sum_{h=1}^k{\sum_{j=1}^k{W_{hj} tr(M \Phi P_h \Phi) tr(M \Phi P_j \Phi)}} \]

\[ A_2 = \sum_{h=1}^k{\sum_{j=1}^k{W_{hj} tr(M \Phi P_h \Phi M \Phi P_j \Phi)}} \]

\[ B = \frac{1}{2c}(A_1 + 6A_2) \]

\[ g = \frac{(c+1)A_1 - (c+4)A_2}{(c+2)A_2} \]

\[ c_1 = \frac{g}{3c+2(1-g)} \]

\[ c_2 = \frac{c-g}{3c+2(1-g)} \]

\[ c_3 = \frac{c+2-g}{3c+2(1-g)} \] \[E^*={\left\{1-\frac{A_2}{c}\right\}}^{-1}\] \[V^*=\frac{2}{c}{\left\{\frac{1+c_1 B}{(1-c_2 B)^2(1-c_3 B)}\right\}}\]

\[\rho = \frac{V^{*}}{2(E^*)^2}\]

\[m = 4 + \frac{c+2}{c\rho - 1}\] \[\lambda = \frac{m}{E^*(m-2)}\]

Parameterization methods and Kenward-Roger

While the Kenward-Roger adjusted covariance matrix is adopting a Taylor series to approximate the true value, the choices of parameterization can change the result. In a simple example of unstructured covariance structure, in our current approach, where the parameters are elements of the Cholesky factor of $\Sigma$ (see parameterization), the second-order derivatives of $\Sigma$ over our parameters, are non-zero matrices. However, if we use the elements of $\Sigma$ as our parameters, then the second-order derivatives are zero matrices. However, the differences can be small and will not affect the inference. If you would like to match SAS results for the unstructured covariance model, you can use the linear Kenward-Roger approximation.

References

Kackar, Raghu N., and David A. Harville. 1984. “Approximations for Standard Errors of Estimators of Fixed and Random Effects in Mixed Linear Models.” Journal of the American Statistical Association 79 (388): 853–62. https://doi.org/10.1080/01621459.1984.10477102.

Kenward, Michael G., and James H. Roger. 1997. “Small Sample Inference for Fixed Effects from Restricted Maximum Likelihood.” Biometrics 53 (3): 983–97. https://doi.org/10.2307/2533558.