General formula for the asymptotic covariance of the OLS estimator

This post derives the general formula for the covariance of the ordinary least squares (OLS) estimator.

Imagine we are in the regression setup with design matrix {\bf X} \in \mathbb{R}^{n \times p} and response {\bf y} \in \mathbb{R}^n. Let {\bf x}_i \in \mathbb{R}^p and y_i \in \mathbb{R} denote the ith row of \bf X and the ith element of \bf y respectively. We can always make the following decomposition:

y_i = \mathbb{E} [y_i \mid {\bf x_i}] + \varepsilon_i,

where \mathbb{E}[\varepsilon_i \mid {\bf x_i}] = 0 and \varepsilon is uncorrelated with any function of {\bf x}_i. (This is Theorem 3.1.1 of Reference 1.)

The population regression function approximates y_i as {\bf x}_i^T \beta, where \beta solves the minimization problem

\beta = \text{argmin}_b \: \mathbb{E} \left[ (y_i - {\bf x}_i^T b)^2 \right].

It can be shown that

\beta = \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E} [{\bf x}_i y_i].

The ordinary least squares (OLS) estimator is a sample version of this and is given by

\begin{aligned} \hat\beta = ({\bf X}^T {\bf X})^{-1} {\bf X}^T {\bf y} = \left( \sum_i {\bf x}_i {\bf x}_i^T \right)^{-1} \left( \sum_i {\bf x}_i y_i \right). \end{aligned}

We are often interested in estimating the covariance matrix of \hat\beta as it is needed to construct standard errors for \hat\beta. Defining \varepsilon_i = y_i - {\bf x}_i^T \beta as the ith residual, we can rewrite the above as

\begin{aligned} \hat{\beta} &= \beta + \left( \sum_i {\bf x}_i {\bf x}_i^T \right)^{-1} \left( \sum_i {\bf x}_i \varepsilon_i \right), \\  \sqrt{n}(\hat{\beta} - \beta) &= n \left( \sum_i {\bf x}_i {\bf x}_i^T \right)^{-1} \cdot \dfrac{1}{\sqrt{n}} \left( \sum_i {\bf x}_i \varepsilon_i \right). \end{aligned}

By Slutsky’s Theorem, the quantity above has the same asymptotic distribution as \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \cdot \frac{1}{\sqrt{n}} \left( \sum_i {\bf x}_i \varepsilon_i \right). Since \mathbb{E}[{\bf x}_i \varepsilon_i] = {\bf 0}*, the Central Limit Theorem tells us that \frac{1}{\sqrt{n}} \left( \sum_i {\bf x}_i \varepsilon_i \right) is asymptotically normally distributed with mean zero and covariance \mathbb{E}[{\bf x}_i {\bf x}_i^T \varepsilon_i^2] (the matrix of fourth moments). Thus,

\begin{aligned} \sqrt{n}(\hat{\beta} - \beta) \stackrel{d}{\rightarrow} \mathcal{N} \left( {\bf 0}, \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E}[{\bf x}_i {\bf x}_i^T \varepsilon_i^2] \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \right). \end{aligned}

We can use the diagonal elements of \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E}[{\bf x}_i {\bf x}_i^T \varepsilon_i^2] \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} to construct standard errors of \hat\beta. The standard errors computed in this way are called heteroskedasticity-consistent standard errors (or White standard errors, or Eicker-White standard errors). They are “robust” in the sense that they use few assumptions on the data and the model: only those needed to make the Central Limit Theorem go through.

*Note: We do NOT need to assume that \mathbb{E} [y_i \mid {\bf x_i}] is linear in order to conclude that \mathbb{E}[{\bf x}_i \varepsilon_i] = {\bf 0}. All we need are the relations \beta = \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E} [{\bf x}_i y_i] and \varepsilon_i = y_i - {\bf x}_i^T \beta. The derivation is as follows:

\begin{aligned} \mathbb{E}[{\bf x}_i \varepsilon_i] &= \mathbb{E}[{\bf x}_i (y_i - {\bf x}_i^T \beta)] \\  &= \mathbb{E}[{\bf x}_i y_i] - \mathbb{E}[ {\bf x}_i {\bf x}_i^T] \beta \\  &= \mathbb{E}[{\bf x}_i y_i] - \mathbb{E}[ {\bf x}_i {\bf x}_i^T] \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E} [{\bf x}_i y_i] \\  &= {\bf 0}. \end{aligned}

Special case of homoskedastic errors

If we assume that the errors are homoskedastic, i.e.

\mathbb{E}[\varepsilon_i \mid X_i] = \sigma^2 for all i

for some constant \sigma, then the asymptotic covariance simplifies a little:

\begin{aligned} \sqrt{n}(\hat{\beta} - \beta) \stackrel{d}{\rightarrow} \mathcal{N} \left( {\bf 0}, \sigma^2 \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \right). \end{aligned}

References:

  1. Angrist, J. D., and Pischke, J.-S. (2009). Mostly harmless econometrics (Section 3.1.3).

Generating correlation matrix for AR(1) model

Assume that we are in the time series data setting, where we have data at equally-spaced times 1, 2, \dots which we denote by random variables X_1, X_2, \dots. The AR(1) model, commonly used in econometrics, assumes that the correlation between X_i and X_j is \text{Cor}(X_i, X_j) = \rho^{|i-j|}, where \rho is some parameter that usually has to be estimated.

If we were writing out the full correlation matrix for n consecutive data points X_1, \dots, X_n, it would look something like this:

\begin{pmatrix} 1 & \rho & \rho^2 & \dots & \rho^{n-1} \\ \rho & 1 & \rho & \dots & \rho^{n-2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} &\dots & 1 \end{pmatrix}

(Side note: This is an example of a correlation matrix which has Toeplitz structure.)

Given \rho, how can we generate this matrix quickly in R? The function below is my (current) best attempt:

ar1_cor <- function(n, rho) {
exponent <- abs(matrix(1:n - 1, nrow = n, ncol = n, byrow = TRUE) - 
    (1:n - 1))
rho^exponent
}

In the function above, n is the number of rows in the desired correlation matrix (which is the same as the number of columns), and rho is the \rho parameter. The function makes use of the fact that when subtracting a vector from a matrix, R automatically recycles the vector to have the same number of elements as the matrix, and it does so in a column-wise fashion.

Here is an example of how the function can be used:

ar1_cor(4, 0.9)
#       [,1] [,2] [,3]  [,4]
# [1,] 1.000 0.90 0.81 0.729
# [2,] 0.900 1.00 0.90 0.810
# [3,] 0.810 0.90 1.00 0.900
# [4,] 0.729 0.81 0.90 1.000

Such a function might be useful when trying to generate data that has such a correlation structure. For example, it could be passed as the Sigma parameter for MASS::mvrnorm(), which generates samples from a multivariate normal distribution.

Can you think of other ways to generate this matrix?

Properties of covariance matrices

This post lists some properties of covariance matrices that I often forget.

A covariance matrix \Sigma \in \mathbb{R}^{n \times n} is simply a matrix such that there exists some random vector X \in \mathbb{R}^n such that \sigma_{ij} = \text{Cov}(X_i, X_j) for all i and j.

Properties:

  1. \Sigma is symmetric since \text{Cov}(X_i, X_j) = \text{Cov}(X_j, X_i).
  2. \Sigma is positive semi-definite (PSD):
    \begin{aligned} u^T \Sigma u &= \sum_{i, j=1}^n u_i \Sigma_{ij}u_j \\  &= \sum_{i, j=1}^n \text{Cov}(u_i X_i, u_j X_j) \\ &= \text{Cov}\left( \sum_i u_i X_i, \sum_j u_j X_j \right) \geq 0. \end{aligned}
  3. Because \Sigma is PSD, all of its eigenvalues are non-negative. (If \Sigma was positive definite, then its eigenvalues would be positive.)
  4. Since \Sigma is real and symmetric, all of its eigenvalues are real, and there exists a real orthogonal matrix Q such that D = Q^T \Sigma Q is a diagonal matrix. (The entries along the diagonal of D are \Sigma‘s eigenvalues.)
  5. Since \Sigma‘s eigenvalues are all non-negative, \Sigma has a square root: \Sigma^{1/2} = QD^{1/2}. (If \Sigma is positive definite, then it has a negative square root as well:  \Sigma^{-1/2} = QD^{-1/2}.)

Note that the above apply to sample covariance matrices as well.

References:

  1. Wikipedia. Covariance matrix.
  2. Statistical Odds & Ends. Properties of real symmetric matrices.

What do we mean by isotropic/anisotropic covariance?

Update (2019-10-24): I totally messed up the definitions the first time I posted this! I’ve fixed it now. Many thanks to commenter szcfweiya to pointing this out!

Let \{ X_t \}_{t \in \mathbb{I}} be a stochastic process, where \mathbb{I} is the index set for the stochastic process. (Most often we have \mathbb{I} = [0, \infty) to index time or \mathbb{I} = \mathbb{R}^d to index space). The stochastic process has an associated covariance function K: \mathbb{I} \times \mathbb{I} \mapsto \mathbb{R} such that for any s, t \in \mathbb{I}, \text{Cov}(X_s, X_t) = K(s, t).

In general, a covariance function must satisfy two properties:

  1. It is symmetric, i.e. K(x, x') = K(x', x) for all x, x' \in \mathbb{I}, and
  2. It is positive semi-definite, i.e. for all n \in \mathbb{N}, x_1, x_2, \dots, x_n \in \mathbb{I}, a_1, \dots, a_n \in \mathbb{R}, \begin{aligned} \sum_{i=1}^n \sum_{j=1}^n a_i K(x_i, x_j) a_j \geq 0 \end{aligned}.

A covariance is isotropic if K(x, x') depends only on the distance \| x - x' \|.  A covariance is said to be anisotropic if it is not isotropic. That is, K(x, x') either does not depend on the distance \| x - x' \|, or it depends on \| x - x' \| as well as some other functions of x and x'.

Clearly the class of isotropic covariances is much smaller than that of anisotropic covariances. To model anisotropic covariances, one usually has to make an assumption on how K(x, x') depends on x and x'.

In my little googling around, anisotropic covariance modeling seems to be popular in geostatistics and more broadly, spatial statistics. One popular example of anisotropic covariance is called geometric anisotropy. In that setting, the index set of the stochastic process is \mathbb{R}^n (typically with n = 1 or n = 2). Isotropic covariance in this setting would have the form K(x, x') = \rho (d (x, x')), where \rho is some function and d is the Euclidean metric. (See this earlier post for some examples. Not all kernels there are isotropic but it should be obvious which are.) Geometric anisotropy refers to a covariance of the form K(x, x') = \rho (d' (x, x')), where d' is some other distance metric. Some examples (from Reference 1) are the Mahalanobis distance and the Minkowski distance.

A special case of using the Mahalanobis distance is where d'(x, x') = \sqrt{(x-x')D(x-x')} for some diagonal matrix D. This corresponds to giving each axis a different scale before computing the Euclidean distance.

References:

  1. Haskard, K. A. (2007) An anisotropic Matérn spatial covariance model: REML estimation and properties.

Sampling paths from a Gaussian process

Gaussian processes are a widely employed statistical tool because of their flexibility and computational tractability. (For instance, one recent area where Gaussian processes are used is in machine learning for hyperparameter optimization.)

A stochastic process \{ X_t \}_{t \in \mathbb{I}} is a Gaussian process if (and only if) any finite subcollection of random variables (X_{t_1}, \dots, X_{t_n}) has a multivariate Gaussian distribution. Here, \mathbb{I} is the index set for the Gaussian process; most often we have \mathbb{I} = [0, \infty) (to index time) or \mathbb{I} = \mathbb{R}^d (to index space).

The stochastic nature of Gaussian processes also allows it to be thought of as a distribution over functions. One draw from a Gaussian process over corresponds to choosing a function f: \mathbb{I} \mapsto \mathbb{R} according to some probability distribution over these functions.

Gaussian processes are defined by their mean and covariance functions. The covariance (or kernel) function K: \mathbb{I} \times \mathbb{I} \mapsto \mathbb{R} is what characterizes the shapes of the functions which are drawn from the Gaussian process. In this post, we will demonstrate how the choice of covariance function affects the shape of functions it produces. For simplicity, we will assume \mathbb{I} = \mathbb{R}.

(Click on this link to see all code for this post in one script. For more technical details on the covariance functions, see this previous post.)

Overall set-up

Let’s say we have a zero-centered Gaussian process denoted by GP(m(\cdot), K(\cdot, \cdot)), and that f is a function drawn from this Gaussian process. For a vector (x_1, \dots, x_n), the function values (f(x_1), \dots, f(x_n)) must have a multivariate Gaussian distribution with mean (m(x_1), \dots, m(x_n)) and covariance matrix \Sigma with entries \Sigma_{ij} = K(x_i, x_j). We make use of this property to draw this function: we select a fine grid of x-coordinates, use mvrnorm() from the MASS package to draw the function values at these points, then connect them with straight lines.

Assume that we have already written an R function kernel_fn for the kernel. The first function below generates a covariance matrix from this kernel, while the second takes N draws from this kernel (using the first function as a subroutine):

library(MASS)

# generate covariance matrix for points in `x` using given kernel function
cov_matrix <- function(x, kernel_fn, ...) {
    outer(x, x, function(a, b) kernel_fn(a, b, ...))
}

# given x coordinates, take N draws from kernel function at those points
draw_samples <- function(x, N, seed = 1, kernel_fn, ...) {
    Y <- matrix(NA, nrow = length(x), ncol = N)
    set.seed(seed)
    for (n in 1:N) {
        K <- cov_matrix(x, kernel_fn, ...)
        Y[, n] <- mvrnorm(1, mu = rep(0, times = length(x)), Sigma = K)
    }
    Y
}

The ... argument for the draw_samples() function allows us to pass arguments into the kernel function kernel_fn.

We will use the following parameters for the rest of the post:

x <- seq(0, 2, length.out = 201)  # x-coordinates
N <- 3  # no. of draws
col_list <- c("red", "blue", "black")  # for line colors

Squared exponential (SE) kernel

The squared exponential (SE) kernel, also known as the radial basis function (RBF) kernel or the Gaussian kernel has the form

\begin{aligned} K(x, x') = \sigma^2 \exp \left[ -\frac{\| x - x' \|^2}{2l^2} \right], \end{aligned}

where \sigma^2 \geq 0 and l > 0 are hyperparameters. (All kernels have a \sigma^2 parameter which determines how variable the function is overall; for simplicity we will assume it to be equal to 1 for the rest of this post.) It is possibly the most commonly used kernel because of its computational tractability. The following code generates 3 draws from the SE kernel with $latex l = 0.2:

se_kernel <- function(x, y, sigma = 1, length = 1) {
    sigma^2 * exp(- (x - y)^2 / (2 * length^2))
}

Y <- draw_samples(x, N, kernel_fn = se_kernel, length = 0.2)
plot(range(x), range(Y), xlab = "x", ylab = "y", type = "n",
     main = "SE kernel, length = 0.2")
for (n in 1:N) {
    lines(x, Y[, n], col = col_list[n], lwd = 1.5)
}

The following code shows how changing the “length-scale” parameter l affects the functions drawn. The smaller l is, the more wiggly the functions drawn.

par(mfrow = c(1, 3))
for (l in c(0.2, 0.7, 1.5)) {
    Y <- draw_samples(x, N, kernel_fn = se_kernel, length = l)
    
    plot(range(x), range(Y), xlab = "x", ylab = "y", type = "n",
         main = paste("SE kernel, length =", l))
    for (n in 1:N) {
        lines(x, Y[, n], col = col_list[n], lwd = 1.5)
    }
}

Rational quadratic (RQ) kernel

The rational quadratic (RQ) kernel has the form

\begin{aligned} K(x, x') = \sigma^2 \left( 1 + \frac{\| x - x' \|^2}{2 \alpha l^2}\right)^{-\alpha}, \end{aligned}

where \sigma \geq 0, l > 0 and \alpha > 0 are hyperparameters. Below we create a function for the kernel and demonstrate how l affects the functions drawn:

rq_kernel <- function(x, y, alpha = 1, sigma = 1, length = 1) {
    sigma^2 * (1 + (x - y)^2 / (2 * alpha * length^2))^(-alpha)
}

par(mfrow = c(1, 3))
for (a in c(0.01, 0.5, 50)) {
    Y <- draw_samples(x, N, kernel_fn = rq_kernel, alpha = a)
    
    plot(range(x), range(Y), xlab = "x", ylab = "y", type = "n",
         main = paste("RQ kernel, alpha =", a))
    for (n in 1:N) {
        lines(x, Y[, n], col = col_list[n], lwd = 1.5)
    }
}

Matérn covariance functions

The Matérn covariance function has the form

\begin{aligned} K(x, x') = \sigma^2 \frac{2^{1-\nu}}{\Gamma (\nu)} \left( \frac{\sqrt{2\nu} \|x-x'\|}{l}\right)^\nu K_\nu \left( \frac{\sqrt{2\nu} \|x-x'\|}{l} \right), \end{aligned}

where \Gamma is the gamma function and K_\nu is the modified Bessel function of the second kind. The hyperparameters are \sigma \geq 0, l > 0 and \nu \geq 0. For \nu = p + 1/2 for some integer p, the expression on the RHS becomes a little nicer. In practice the values of \nu = 3/2 and \nu = 5/2 are used much more often than anything else, so that is all we code up here.

matern_kernel <- function(x, y, nu = 1.5, sigma = 1, l = 1) {
    if (!(nu %in% c(0.5, 1.5, 2.5))) {
        stop("p must be equal to 0.5, 1.5 or 2.5")
    }
    p <- nu - 0.5
    d <- abs(x - y)
    if (p == 0) {
        sigma^2 * exp(- d / l)
    } else if (p == 1) {
        sigma^2 * (1 + sqrt(3)*d/l) * exp(- sqrt(3)*d/l)
    } else {
        sigma^2 * (1 + sqrt(5)*d/l + 5*d^2 / (3*l^2)) * exp(-sqrt(5)*d/l)
    }
}

par(mfrow = c(1, 3))
for (nu in c(0.5, 1.5, 2.5)) {
    Y <- draw_samples(x, N, kernel_fn = matern_kernel, nu = nu)
    
    plot(range(x), range(Y), xlab = "x", ylab = "y", type = "n",
         main = paste("Matern kernel, nu =", nu * 2, "/ 2"))
    for (n in 1:N) {
        lines(x, Y[, n], col = col_list[n], lwd = 1.5)
    }
}

The paths from the Matérn-1/2 kernel are often deemed too rough to be used in practice.

Periodic kernel

The periodic kernel has the form

\begin{aligned} K(x, x') = \sigma^2 \exp \left[ - \frac{2 \sin^2 (\pi \| x - x'\| / p) }{l^2} \right], \end{aligned}

where \sigma \geq 0, l > 0 and p > 0 are hyperparameters. It is good for modeling functions which repeat themselves exactly. p is the period of the function, as can be seen from the paths below:

period_kernel <- function(x, y, p = 1, sigma = 1, length = 1) {
    sigma^2 * exp(-2 * sin(pi * abs(x - y) / p)^2 / length^2)
}

par(mfrow = c(1, 3))
for (p in c(0.5, 1, 2)) {
    Y <- draw_samples(x, N, kernel_fn = period_kernel, p = p)
    
    plot(range(x), range(Y), xlab = "x", ylab = "y", type = "n",
         main = paste("Periodic kernel, p =", p))
    for (n in 1:N) {
        lines(x, Y[, n], col = col_list[n], lwd = 1.5)
    }
}

Linear/polynomial kernel

The polynomial kernel has the form

\begin{aligned} K(x, x') = (x^T x' + \sigma^2)^d, \end{aligned}

where d \in \mathbb{N} is the degree of the polynomial and \sigma \geq 0 is a hyperparameter. The first plot shows how the value of \sigma can affect the linear kernel, while the second plot shows how the dimension affects the shape of the polynomial kernel.

poly_kernel <- function(x, y, sigma = 1, d = 1) {
    (sigma^2 + x * y)^d
}

# linear kernel w different sigma
par(mfrow = c(1, 3))
for (s in c(0.5, 1, 5)) {
    Y <- draw_samples(x, N, kernel_fn = poly_kernel, sigma = s)
    
    plot(range(x), range(Y), xlab = "x", ylab = "y", type = "n",
         main = paste("Linear kernel, sigma =", s))
    for (n in 1:N) {
        lines(x, Y[, n], col = col_list[n], lwd = 1.5)
    }
}

# poly kernel of different dimensions
par(mfrow = c(1, 3))
for (d in c(1, 2, 3)) {
    Y <- draw_samples(x, N, kernel_fn = poly_kernel, d = d)
    
    plot(range(x), range(Y), xlab = "x", ylab = "y", type = "n",
         main = paste("Polynomial kernel, d =", d))
    for (n in 1:N) {
        lines(x, Y[, n], col = col_list[n], lwd = 1.5)
    }
}

Brownian motion

Brownian motion, the most studied object in stochastic processes, is a one-dimensional Gaussian process with mean zero and covariance function K(x, x') = \min (x, x'). Its paths are extremely rough.

bm_kernel <- function(x, y) {
    pmin(x, y)
}

Y <- draw_samples(x, N, kernel_fn = bm_kernel)

plot(range(x), range(Y), xlab = "x", ylab = "y", type = "n",
     main = "Brownian motion kernel")
for (n in 1:N) {
    lines(x, Y[, n], col = col_list[n], lwd = 1.5)
}

Note that I had to use the pmin() function instead of min() in the Brownian motion covariance function. This is because the outer(X, Y, FUN, ...) function we called in cov_matrix()

“…[extends X and Y] by rep to length the products of the lengths of X and Y before FUN is called.” (from outer() documentation)

pmin() would perform the operation we want, while min() would simply return the minimum value present in the two arguments, not what we want.

Common covariance classes for Gaussian processes

A stochastic process \{ X_t \}_{t \in \mathbb{I}} is a Gaussian process if (and only if) any finite subcollection of random variables (X_{t_1}, \dots, X_{t_n}) has a multivariate Gaussian distribution. Here, \mathbb{I} is the index set for the stochastic process; most often we have \mathbb{I} = [0, \infty) (to index time) or \mathbb{I} = \mathbb{R}^d (to index space).

To define a Gaussian process, one needs (i) a mean function \mu : \mathbb{I} \mapsto \mathbb{R}, and (ii) a covariance function K: \mathbb{I} \times \mathbb{I} \mapsto \mathbb{R}. While there are no restrictions on the mean function, the covariance function must be:

  1. Symmetric, i.e. K(x, x') = K(x', x) for all x, x' \in \mathbb{I}, and
  2. Positive semi-definite, i.e. for all n \in \mathbb{N}, x_1, x_2, \dots, x_n \in \mathbb{I}, a_1, \dots, a_n \in \mathbb{R}, \begin{aligned} \sum_{i=1}^n \sum_{j=1}^n a_i K(x_i, x_j) a_j \geq 0 \end{aligned}.

Covariance functions are sometimes called kernels. Here are some commonly used covariance functions (unless otherwise stated, these kernels are applicable for \mathbb{I} = \mathbb{R}^d):

Squared exponential (SE) kernel

  • Also known as the radial basis function (RBF) kernel, or the Gaussian kernel.
  • Has the form \begin{aligned} K(x, x') = \sigma^2 \exp \left[ -\frac{\| x - x' \|^2}{2l^2} \right] \end{aligned}, where \sigma^2 \geq 0 and l > 0 are hyperparameters.
    • \sigma^2 is an overall scale factor that every kernel has, determining the overall variance.
    • l is a “length-scale” hyperparameter that determines how “wiggly” the function is: larger l means that it is less wiggly.
  • The functions drawn from this process are infinitely differentiable (i.e. very smooth). This strong smoothness assumption is probably unrealistic in practice. Nevertheless, the SE kernel remains one of the most popular kernels.
  • This kernel is stationary (i.e. value of K(x, x') depends only on x - x') and isotropic (i.e. value of K(x, x') depends only on \|x - x' \|).
  • It is possible for each dimension to have its own length-scale hyperparameter: we would replace the exponent with \begin{aligned} -\frac{1}{2}\sum_{i=1}^d \left(\frac{x_i - x_i'}{l_i}\right)^2 \end{aligned}. (This generalization can be done for any stationary kernel. Note that the resulting kernel will no longer be isotropic.)

Rational quadratic (RQ) kernel

  • Has the form \begin{aligned} K(x, x') = \sigma^2 \left( 1 + \frac{\| x - x' \|^2}{2 \alpha l^2}\right)^{-\alpha} \end{aligned}, where \sigma \geq 0, l > 0 and \alpha > 0 are hyperparameters.
    • As in the SE kernel, l is a length-scale parameter.
    • The rational quadratic kernel can be viewed as a scale mixture of SE kernels with different length scales. Larger values of \alpha give more weight to the SE kernels with longer length scales. As \alpha \rightarrow \infty, the RQ kernel becomes the SE kernel.

Matérn covariance functions

  • Named after Swedish statistician Bertil Matérn.
  • Has the form \begin{aligned} K(x, x') = \sigma^2 \frac{2^{1-\nu}}{\Gamma (\nu)} \left( \frac{\sqrt{2\nu} \|x-x'\|}{l}\right)^\nu K_\nu \left( \frac{\sqrt{2\nu} \|x-x'\|}{l} \right) \end{aligned}, where \Gamma is the gamma function, K_\nu is the modified Bessel function of the second kind.
  • The hyperparameters are \sigma \geq 0, l > 0 and \nu \geq 0.
  • The functions drawn from this process are \lceil \nu \rceil - 1 times differentiable.
  • The larger \nu is, the smoother the functions drawn from this process. As \nu \rightarrow \infty, this kernel converges to the SE kernel.
  • When \nu = p + 1/2 for some integer p, the kernel can be written as a product of an exponential and a polynomial of order p. For this reason, the values \nu = 1/2, \nu = 3/2 and \nu = 5/2 are commonly used. The latter two are more popular as the samples from \nu = 1/2 are often thought to be too “rough”. (Rasmussen & Williams make the case that it is hard to distinguish between values of \nu \geq 7/2 and \nu = \infty.)
  • This kernel is stationary and isotropic.
  • When \nu = 1/2, the resulting kernel is known as the exponential covariance function. If we further restrict \mathbb{I} = \mathbb{R}, it is called the Ornstein-Uhlenbeck process.

Periodic kernel

  • Has the form \begin{aligned} K(x, x') = \sigma^2 \exp \left[ - \frac{2 \sin^2 (\pi \| x - x'\| / p) }{l^2} \right] \end{aligned}, where \sigma \geq 0, l > 0 and p > 0 are hyperparameters.
    • p is the period of the function, determining the distance between repetitions of the function.
  • Good for modeling functions which repeat themselves exactly.
  • Sometimes functions repeat themselves almost exactly, not exactly. In this situation, we can use the product of the periodic kernel with another kernel (the product of two kernels is itself a kernel). Such kernels are known as locally periodic kernels.

Linear/polynomial kernel

  • The linear kernel has the form \begin{aligned} K(x, x') = x^T x' + \sigma^2 \end{aligned}, where \sigma \geq 0 is a hyperparameter.
  • The polynomial kernel generalizes the linear kernel: \begin{aligned} K(x, x') = (x^T x' + \sigma^2)^d \end{aligned}, where d \in \mathbb{N} is the degree of the polynomial, usually taken to be 2.
  • It is a nonstationary kernel.

Brownian motion

  • Brownian motion is a one-dimensional Gaussian process with mean zero and covariance function K(x, x') = \min (x, x').

See this webpage for a longer list of kernels.

References:

  1. Duvenaud, D. The Kernel Cookbook: Advice on Covariance functions.
  2. Rasmussen, C. E., and Williams, C. K. I. (2006). Gaussian processes for machine learning. Chapter 4: Covariance Functions.
  3. Snelson, E. (2006). Tutorial: Gaussian process models for machine learning.
  4. Wikipedia. Matérn covariance function.

Eigenvalues of a special correlation matrix

Consider the covariance matrix \Sigma \in \mathbb{R}^{d \times d} such that

\Sigma_{ij} = \begin{cases} 1 &\text{if } i = j, \\ \rho &\text{if } i \neq j. \end{cases}

This represents a highly idealized situation where all \binom{d}{2} pairs of features have the same correlation. It turns out that computing the eigenvalues for \Sigma is really easy!

Recall that \lambda is an eigenvalue for a matrix \Sigma iff \det (\Sigma - \lambda I) = 0. Setting \lambda = 1 - \rho makes \Sigma - \lambda I a square matrix with every entry being \rho, which clearly has determinant zero. Notice also that the set of eigenvectors associated with eigenvalue 1-\rho, \{ x: (\Sigma - (1-\rho)I)x = 0 \}, has dimension d - 1. This means that the eigenvalue 1 - \rho has multiplicity d - 1.

There is one remaining eigenvalue to find. Since the rows of \Sigma all have the same sum, the vector (1, \dots, 1)^T is going to be an eigenvector:

\Sigma \begin{pmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{pmatrix} = \begin{pmatrix} 1 + (d-1)\rho \\ 1 + (d-1)\rho \\ \vdots \\ 1 + (d-1)\rho \end{pmatrix}.

Hence, the other eigenvalue of \Sigma is 1 + (d-1)\rho.

The solution can be verified for specific instances with the R code below:

rho <- 0.8
d <- 7
S <- rho * matrix(1, nrow = d, ncol = d) + 
    diag(rep(1-rho, d))
svd(S)$d
1 - rho
1 + (d-1)*rho

Toeplitz covariance structure

When someone says that their model has Toeplitz covariance (or correlation) structure, what do they mean?

In linear algebra, a Toeplitz matrix is also known as a diagonal-constant matrix: i.e. “each descending diagonal from left to right is constant”:

A = \begin{bmatrix} a_0 & a_{-1} & \dots & \dots & a_{-(n-1)} \\ a_1 & a_0 & a_{-1} &\ddots & \vdots \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ \vdots & \ddots & a_1 & a_0 & a_{-1} \\ a_{n-1} & \dots & \dots & a_1 & a_0  \end{bmatrix}.

A model is said to have Toeplitz covariance (correlation resp.) structure if the covariance  (correlation resp.) matrix is a Toeplitz matrix. Here are 2 places where we see such structures pop up:

  • We have time series data at equally-spaced times 1, 2, \dots, denoted by X_1, X_2, \dots. This model has Toeplitz covariance (correlation resp.) structure if for any n, the covariance (correlation resp.) matrix of X_1, X_2, \dots, X_n is Toeplitz. The AR(1) model, commonly used in econometrics, is an example of this, since it has \text{Cor}(X_i, X_j) = \rho^{|i-j|} for some \rho.
  • We have a continuous-time stochastic process \{ X_t \} which is weakly stationary, i.e. for any 2 times t and s, \text{Cov}(X_t, X_s) depends only on t - s. In this setting, for any equally-spaced times t_1, \dots, t_n, the covariance matrix of X_{t_1}, \dots, X_{t_n} will be Toeplitz.

Why work with Toeplitz covariance structure? Other than the fact that they arise naturally in certain situations (like the two above), operations with Toeplitz matrices are fast, and a matrix inverse always exists.

References:

  1. Toeplitz matrix, Wikipedia.
  2. Guidelines for Selecting the Covariance Structure in Mixed Model Analysis, Chuck Kincaid.
  3. Toeplitz Covariance Matrix Estimation for Adaptive Beamforming and Ultrasound Imaging, Michael Pettersson.