Skip to contents

Here we describe the exact model definition as well as the estimation algorithms in detail. After reading through this vignette, you can follow the implementation of the algorithm in mmrm.cpp and the covariance structures in covariance.h in the src directory of this package.

Model definition

The mixed model for repeated measures (MMRM) definition we are using in this package is the following. Let i=1,,ni = 1, \dotsc, n denote the subjects from which we observe multiple observations j=1,,mij = 1, \dotsc, m_i from total mim_i time points tij{t1,,tm}t_{ij} \in \{t_1, \dotsc, t_m\}. Note that the number of time points for a specific subject, mim_i, can be smaller than mm, when only a subset of the possible mm time points have been observed.

Linear model

For each subject ii we observe a vector Yi=(yi1,,yimi)mi Y_i = (y_{i1}, \dotsc, y_{im_i})^\top \in \mathbb{R}^{m_i} and given a design matrix Ximi×p X_i \in \mathbb{R}^{m_i \times p} and a corresponding coefficient vector βp\beta \in \mathbb{R}^{p} we assume that the observations are multivariate normal distributed: YiN(Xiβ,Σi) Y_i \sim N(X_i\beta, \Sigma_i) where the covariance matrix Σimi×mi\Sigma_i \in \mathbb{R}^{m_i \times m_i} is derived by subsetting the overall covariance matrix Σm×m\Sigma \in \mathbb{R}^{m \times m} appropriately by Σi=Gi1/2SiΣSiGi1/2 \Sigma_i = G_i^{-1/2} S_i^\top \Sigma S_i G_i^{-1/2} where the subsetting matrix Si{0,1}m×miS_i \in \{0, 1\}^{m \times m_i} contains in each of its mim_i columns contains a single 1 indicating which overall time point is matching tijt_{ij}. Each row contains at most a single 1 but can also contain only 0 if this time point was not observed. For example, assume a subject was observed on time points 1,3,41, 3, 4 out of total 55 then the subsetting matrix is Si=(100000010001000). S_i = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{pmatrix}. Gi>0mi×miG_i \in \mathbb{R}_{\gt 0}^{m_i \times m_i} is the diagonal weight matrix, which is the identity matrix if no weights are specified. Note that this follows from the well known property of the multivariate normal distribution that linear combinations of the random vector again have a multivariate normal distribution with the correspondingly modified mean vector and covariance matrix.

Conditional on the design matrices XiX_i, the coefficient vector β\beta and the covariance matrix Σ\Sigma we assume that the observations are independent between the subjects.

We can write the linear model for all subjects together as Y=Xβ+ϵ Y = X\beta + \epsilon where YNY \in \mathbb{R}^N combines all subject specific observations vectors YiY_i such that we have in total N=i=1nmiN = \sum_{i = 1}^{n}{m_i} observations, XN×pX \in \mathbb{R}^{N \times p} combines all subject specific design matrices and ϵN\epsilon \in \mathbb{R}^N has a multivariate normal distribution ϵN(0,Ω) \epsilon \sim N(0, \Omega) where ΩN×N\Omega \in \mathbb{R}^{N \times N} is block-diagonal containing the subject specific Σi\Sigma_i covariance matrices on the diagonal and 0 in the remaining entries.

Covariance matrix model

The symmetric and positive definite covariance matrix Σ=(σ12σ12σ1mσ21σ22σ23σ2mσm1σm,m1σm2) \Sigma = \begin{pmatrix} \sigma_1^2 & \sigma_{12} & \dots & \dots & \sigma_{1m} \\ \sigma_{21} & \sigma_2^2 & \sigma_{23} & \dots & \sigma_{2m}\\ \vdots & & \ddots & & \vdots \\ \vdots & & & \ddots & \vdots \\ \sigma_{m1} & \dots & \dots & \sigma_{m,m-1} & \sigma_m^2 \end{pmatrix} is parametrized by a vector of variance parameters θ=(θ1,,θk)\theta = (\theta_1, \dotsc, \theta_k)^\top. There are many different choices for how to model the covariance matrix and correspondingly θ\theta has different interpretations. Since any covariance matrix has a unique Cholesky factorization Σ=LL\Sigma = LL^\top where LL is the lower triangular Cholesky factor, we are going to use this below.

Unstructured covariance matrix

The most general model uses a saturated parametrization, i.e. any covariance matrix could be represented in this form. Here we use L=DL̃ L = D\tilde{L} where DD is the diagonal matrix of standard deviations, and L̃\tilde{L} is a unit diagonal lower triangular matrix. Hence we start θ\theta with the natural logarithm of the standard deviations, followed by the row-wise filled entries of L̃={lij}1j<im\tilde{L} = \{l_{ij}\}_{1 \leq j < i \leq m}: θ=(log(σ1),,log(σm),l21,l31,l32,,lm,m1) \theta = ( \log(\sigma_1), \dotsc, \log(\sigma_m), l_{21}, l_{31}, l_{32}, \dotsc, l_{m,m-1} )^\top Here θ\theta has k=m(m+1)/2k = m(m+1)/2 entries. For example for m=4m = 4 time points we need k=10k = 10 variance parameters to model the unstructured covariance matrix.

Other covariance matrix choices are explained in the covariance structures vignette.

Grouped covariance matrix

In some cases, we would like to estimate unique covariance matrices across groups, while keeping the covariance structure (unstructured, ante-dependence, Toeplitz, etc.) consistent across groups. Following the notations in the previous section, for subject ii in group g(i)g(i), we have

Σi=SiΣg(i)Si \Sigma_{i} = S_i^\top \Sigma_{g(i)} S_i

where g(i)g(i) is the group of subject ii and Σg(i)\Sigma_{g(i)} is the covariance matrix of group g(i)g(i).

The parametrization of θ\theta is similar to other non-grouped θ\theta. Assume that there are total number of GG groups, the length of θ\theta is multiplied by GG, and for each part, θ\theta is parametrized in the same fashion. For example, for an unstructured covariance matrix, θ\theta has k=G*m(m+1)/2k = G * m(m+1)/2 entries.

Spatial covariance matrix

A spatial covariance structure can model individual-specific visit times. An individual’s covariance matrix is then a function of both the population-level covariance parameters (specific to the chosen structure) and the individual’s visit times. Following the notations in the previous section, for subject ii with total number of mim_i visits, we have

σijk=σ*f(dist(𝐜ij,𝐜ik)) \sigma_{ijk} = \sigma * f(dist(\boldsymbol{c}_{ij}, \boldsymbol{c}_{ik}))

The (mij,mik)(m_{ij}, m_{ik}) element of Σi\Sigma_{i} is a function of the distance between mijm_{ij} and mikm_{ik} visit occurring on tmijt_{m_{ij}} and tmikt_{m_{ik}}. tmijt_{m_{ij}} is the coordinate(time) of mijm_{ij} visit for subject ii. σ\sigma is the constant variance. Usually we use Euclidean distance.

Currently only spatial exponential covariance structure is implemented. For coordinates with multiple dimensions, the Euclidean distance is used without transformations.

Maximum Likelihood Estimation

Given the general linear model above, and conditional on θ\theta, we know that the likelihood for β\beta is L(β;Y)=(2π)N/2det(Ω)1/2exp{12(YXβ)Ω1(YXβ)} L(\beta; Y) = (2\pi)^{-N/2} \det(\Omega)^{-1/2} \exp\left\{ - \frac{1}{2}(Y - X\beta)^\top \Omega^{-1} (Y - X\beta) \right\} and we also know that the maximum likelihood (ML) estimate of β\beta is the weighted least squares estimator β̂\hat{\beta} solving the estimating equation (XΩ1X)β̂=XΩ1Y. (X^\top \Omega^{-1} X) \hat{\beta} = X^\top \Omega^{-1} Y. Plugging in β̂\hat{\beta} into the likelihood above gives then the value of the function we want to maximize with regards to the variance parameters θ\theta. Practically this will be done on the negative log scale: f(θ;β̂)=logL(β̂;Y)=N2log(2π)+12logdet(Ω)+12(YXβ̂)Ω1(YXβ̂) f(\theta; \hat{\beta}) = - \log L(\hat{\beta}; Y) = \frac{N}{2} \log(2\pi) + \frac{1}{2}\log\det(\Omega) + \frac{1}{2} (Y - X\hat{\beta})^\top \Omega^{-1} (Y - X\hat{\beta}) The objective function f(θ;β̂)f(\theta; \hat{\beta}) is then minimized with numerical optimizers utilizing quasi-Newton-Raphson algorithms based on the gradient (or additionally with Hessian, see optimizer). Here the use of the Template Model Builder package TMB is helpful because

  1. TMB allows to perform the calculations in C++, which maximizes the speed.
  2. TMB performs automatic differentiation of the objective function with regards to the variance parameters θ\theta, so that gradient and Hessian do not have to be approximated numerically or coded explicitly.

Weighted least squares estimator

Let’s have a look at the details of calculating the log likelihood above, including in particular the weighted least squares (WLS) estimator β̂\hat{\beta}.

Starting point is the linear equation above and the observation that both the left and right hand sides can be decomposed into subject-specific terms given the block-diagonal structure of Ω\Omega and therefore its inverse, W=Ω1W = \Omega^{-1}: XΩ1X=XWX=i=1nXiWiXi X^\top \Omega^{-1} X = X^\top W X = \sum_{i=1}^{n} X_i^\top W_i X_i and similarly XΩ1Y=XWY=i=1nXiWiYi X^\top \Omega^{-1} Y = X^\top W Y = \sum_{i=1}^{n} X_i^\top W_i Y_i where Wi=Σi1W_i = \Sigma_i^{-1} is the weight matrix for subject ii, the inverse of its covariance matrix.

Instead of calculating this inverse explicitly, it is always better numerically to work with the Cholesky factorization and solve linear equations instead. Here we calculate the factorization Σi=LiLi\Sigma_i = L_i L_i^\top. Note that in the case where mi=mm_i = m, i.e. this subject has all time points observed, then Σi=Σ\Sigma_i = \Sigma and we don’t need to calculate this again because we have already Σ=LL\Sigma = L L^\top, i.e. Li=LL_i = L. Unfortunately, if mi<mm_i < m, then we need to calculate this explicitly, as there is no way to update the Cholesky factorization for a subset operation Σi=SiΣSi\Sigma_i = S_i^\top \Sigma S_i as we have above. Given LiL_i, we solve LiX̃i=Xi L_i \tilde{X}_i = X_i for X̃i\tilde{X}_i with an efficient forward-solve, and similarly we solve LiỸi=Yi L_i \tilde{Y}_i = Y_i for Ỹi\tilde{Y}_i. Therefore we have XiWiXi=X̃iX̃i X_i^\top W_i X_i = \tilde{X}_i^\top \tilde{X}_i and XiWiYi=X̃iỸi X_i^\top W_i Y_i = \tilde{X}_i^\top \tilde{Y}_i and we can thereby calculate the left and right hand sides for the WLS estimating equation. We solve this equation with a robust Cholesky decomposition with pivoting. The advantage is that we can reuse this decomposition for calculating the covariance matrix of β̂\hat{\beta}, i.e. K=(XWX)1K = (X^\top W X)^{-1}, by supplying the identity matrix as alternative right hand side.

Determinant and quadratic form

For the objective function we also need the log determinant of Ω\Omega: 12logdet(Ω)=12logdet{blockdiagΣ1,,Σn}=12logi=1ndetΣi=12i=1nlogdetLiLi=i=1nlogdetLi=i=1nj=1milog(li,jj)\begin{align} \frac{1}{2}\log\det(\Omega) &= \frac{1}{2}\log\det\{\text{blockdiag} \Sigma_1, \dotsc, \Sigma_n\} \\ &= \frac{1}{2}\log\prod_{i=1}^{n}\det{\Sigma_i} \\ &= \frac{1}{2}\sum_{i=1}^{n}\log\det{L_i L_i^\top} \\ &= \sum_{i=1}^{n}\log\det{L_i} \\ &= \sum_{i=1}^{n}\sum_{j=1}^{m_i}\log(l_{i, jj}) \end{align} where li,jjl_{i,jj} are the diagonal entries of the factor LiL_i and we have used that

  • the determinant of a block diagonal matrix is the product of the determinants of the blocks,
  • the determinant of the product of matrices is the product of the determinants,
  • the determinant of the transposed matrix is the same as the original one,
  • the determinant of a triangular matrix is the product of the diagonal.

And finally, for the quadratic form we can reuse the weighted response vector and design matrix: (YXβ̂)Ω1(YXβ̂)=i=1n(YiXiβ̂)Wi(YiXiβ̂)=i=1n(ỸiX̃iβ̂)(ỸiX̃iβ̂) (Y - X\hat{\beta})^\top \Omega^{-1} (Y - X\hat{\beta}) = \sum_{i=1}^{n} (Y_i - X_i\hat{\beta})^\top W_i (Y_i - X_i\hat{\beta}) = \sum_{i=1}^{n} (\tilde{Y}_i - \tilde{X}_i\hat{\beta})^\top (\tilde{Y}_i - \tilde{X}_i\hat{\beta})

Restricted Maximum Likelihood Estimation

Under the restricted ML estimation (REML) paradigm we first obtain the marginal likelihood of the variance parameters θ\theta by integrating out the remaining parameters β\beta from the likelihood. Here we have: L(θ;Y)=pL(β;Y)dβ=(2π)N/2det(Ω)1/2pexp{12(YXβ)Ω1(YXβ)}dβ L(\theta; Y) = \int_{\mathbb{R}^p} L(\beta; Y) d\beta = (2\pi)^{-N/2} \det(\Omega)^{-1/2} \int_{\mathbb{R}^p} \exp\left\{ - \frac{1}{2}(Y - X\beta)^\top \Omega^{-1} (Y - X\beta) \right\} d\beta where we note that det(Ω)\det(\Omega) depends on θ\theta but not on β\beta and can therefore be pulled out of the integral.

Completing the square

Let’s focus now on the quadratic form in the exponential function and complete the square with regards to β\beta to obtain the kernel of a multivariate normal distribution: (YXβ)Ω1(YXβ)=YΩ1Y+βXΩ1Xβ2βXΩ1Y=YΩ1Y+βK1β2βK1KXΩ1Y=YΩ1Y+βK1β2βK1β̂=YΩ1Y+βK1β2βK1β̂+β̂1K1β̂β̂1K1β̂=YΩ1Yβ̂1K1β̂+(ββ̂)K1(ββ̂)\begin{align} (Y - X\beta)^\top \Omega^{-1} (Y - X\beta) &= Y^\top \Omega^{-1} Y + \beta^\top X^\top \Omega^{-1} X \beta - 2 \beta^\top X^\top \Omega^{-1} Y \\ &= Y^\top \Omega^{-1} Y + \beta^\top K^{-1} \beta - 2 \beta^\top K^{-1}K X^\top \Omega^{-1} Y \\ &= Y^\top \Omega^{-1} Y + \beta^\top K^{-1} \beta - 2 \beta^\top K^{-1} \hat{\beta} \\ &= Y^\top \Omega^{-1} Y + \beta^\top K^{-1} \beta - 2 \beta^\top K^{-1} \hat{\beta} + \hat{\beta}^{-1} K^{-1} \hat{\beta} - \hat{\beta}^{-1} K^{-1} \hat{\beta} \\ &= Y^\top \Omega^{-1} Y - \hat{\beta}^{-1} K^{-1} \hat{\beta} + (\beta - \hat{\beta})^\top K^{-1} (\beta - \hat{\beta}) \end{align} where we used K=(XWX)1K = (X^\top W X)^{-1} and could early on identify KK as the covariance matrix of the kernel of the multivariate normal of β\beta and then later β̂\hat{\beta} as the mean vector.

With this, we know that the integral of the multivariate normal kernel is the inverse of the normalizing constants, and thus pexp{12(YXβ)Ω1(YXβ)}dβ=exp{12YΩ1Y+12β̂1K1β̂}(2π)p/2detK1/2 \int_{\mathbb{R}^p} \exp\left\{ - \frac{1}{2}(Y - X\beta)^\top \Omega^{-1} (Y - X\beta) \right\} d\beta = \exp\left\{ -\frac{1}{2} Y^\top \Omega^{-1} Y + \frac{1}{2} \hat{\beta}^{-1} K^{-1} \hat{\beta} \right\} (2\pi)^{p/2} \det{K}^{1/2} such that the integrated likelihood is L(θ;Y)=(2π)(Np)/2det(Ω)1/2detK1/2exp{12YΩ1Y+12β̂K1β̂}. L(\theta; Y) = (2\pi)^{-(N-p)/2} \det(\Omega)^{-1/2} \det{K}^{1/2} \exp\left\{ -\frac{1}{2} Y^\top \Omega^{-1} Y + \frac{1}{2} \hat{\beta}^\top K^{-1} \hat{\beta} \right\}.

Objective function

As objective function which we want to minimize with regards to the variance parameters θ\theta we again take the negative natural logarithm f(θ)=logL(θ;Y)=Np2log(2π)+12logdet(Ω)12logdet(K)+12ỸỸ12β̂X̃X̃β̂ f(\theta) = -\log L(\theta;Y) = \frac{N-p}{2} \log(2\pi) + \frac{1}{2}\log\det(\Omega) - \frac{1}{2}\log\det(K) + \frac{1}{2} \tilde{Y}^\top \tilde{Y} - \frac{1}{2} \hat{\beta}^\top \tilde{X}^\top \tilde{X} \hat{\beta} It is interesting to see that computation of the REML objective function is only requiring a few additional calculations compared to the ML objective function. In particular, since we already have the matrix decomposition of K1K^{-1}, it is very easy to obtain the determinant of it.

Also here we use numeric optimization of f(θ)f(\theta) and the TMB library supports this efficiently through automatic differentiation.