# Common covariance classes for Gaussian processes

A stochastic process $\{ X_t \}_{t \in \mathbb{I}}$ is a Gaussian process if (and only if) any finite subcollection of random variables $(X_{t_1}, \dots, X_{t_n})$ has a multivariate Gaussian distribution. Here, $\mathbb{I}$ is the index set for the stochastic process; most often we have $\mathbb{I} = [0, \infty)$ (to index time) or $\mathbb{I} = \mathbb{R}^d$ (to index space).

To define a Gaussian process, one needs (i) a mean function $\mu : \mathbb{I} \mapsto \mathbb{R}$, and (ii) a covariance function $K: \mathbb{I} \times \mathbb{I} \mapsto \mathbb{R}$. While there are no restrictions on the mean function, the covariance function must be:

1. Symmetric, i.e. $K(x, x') = K(x', x)$ for all $x, x' \in \mathbb{I}$, and
2. Positive semi-definite, i.e. for all $n \in \mathbb{N}$, $x_1, x_2, \dots, x_n \in \mathbb{I}$, $a_1, \dots, a_n \in \mathbb{R}$, \begin{aligned} \sum_{i=1}^n \sum_{j=1}^n a_i K(x_i, x_j) a_j \geq 0 \end{aligned}.

Covariance functions are sometimes called kernels. Here are some commonly used covariance functions (unless otherwise stated, these kernels are applicable for $\mathbb{I} = \mathbb{R}^d$):

Squared exponential (SE) kernel

• Also known as the radial basis function (RBF) kernel, or the Gaussian kernel.
• Has the form \begin{aligned} K(x, x') = \sigma^2 \exp \left[ -\frac{\| x - x' \|^2}{2l^2} \right] \end{aligned}, where $\sigma^2 \geq 0$ and $l > 0$ are hyperparameters.
• $\sigma^2$ is an overall scale factor that every kernel has, determining the overall variance.
• $l$ is a “length-scale” hyperparameter that determines how “wiggly” the function is: larger $l$ means that it is less wiggly.
• The functions drawn from this process are infinitely differentiable (i.e. very smooth). This strong smoothness assumption is probably unrealistic in practice. Nevertheless, the SE kernel remains one of the most popular kernels.
• This kernel is stationary (i.e. value of $K(x, x')$ depends only on $x - x'$) and isotropic (i.e. value of $K(x, x')$ depends only on $\|x - x' \|$).
• It is possible for each dimension to have its own length-scale hyperparameter: we would replace the exponent with \begin{aligned} -\frac{1}{2}\sum_{i=1}^d \left(\frac{x_i - x_i'}{l_i}\right)^2 \end{aligned}. (This generalization can be done for any stationary kernel. Note that the resulting kernel will no longer be isotropic.)

• Has the form \begin{aligned} K(x, x') = \sigma^2 \left( 1 + \frac{\| x - x' \|^2}{2 \alpha l^2}\right)^{-\alpha} \end{aligned}, where $\sigma \geq 0$, $l > 0$ and $\alpha > 0$ are hyperparameters.
• As in the SE kernel, $l$ is a length-scale parameter.
• The rational quadratic kernel can be viewed as a scale mixture of SE kernels with different length scales. Larger values of $\alpha$ give more weight to the SE kernels with longer length scales. As $\alpha \rightarrow \infty$, the RQ kernel becomes the SE kernel.

Matérn covariance functions

• Named after Swedish statistician Bertil Matérn.
• Has the form \begin{aligned} K(x, x') = \sigma^2 \frac{2^{1-\nu}}{\Gamma (\nu)} \left( \frac{\sqrt{2\nu} \|x-x'\|}{l}\right)^\nu K_\nu \left( \frac{\sqrt{2\nu} \|x-x'\|}{l} \right) \end{aligned}, where $\Gamma$ is the gamma function, $K_\nu$ is the modified Bessel function of the second kind.
• The hyperparameters are $\sigma \geq 0$, $l > 0$ and $\nu \geq 0$.
• The functions drawn from this process are $\lceil \nu \rceil - 1$ times differentiable.
• The larger $\nu$ is, the smoother the functions drawn from this process. As $\nu \rightarrow \infty$, this kernel converges to the SE kernel.
• When $\nu = p + 1/2$ for some integer $p$, the kernel can be written as a product of an exponential and a polynomial of order $p$. For this reason, the values $\nu = 1/2$, $\nu = 3/2$ and $\nu = 5/2$ are commonly used. The latter two are more popular as the samples from $\nu = 1/2$ are often thought to be too “rough”. (Rasmussen & Williams make the case that it is hard to distinguish between values of $\nu \geq 7/2$ and $\nu = \infty$.)
• This kernel is stationary and isotropic.
• When $\nu = 1/2$, the resulting kernel is known as the exponential covariance function. If we further restrict $\mathbb{I} = \mathbb{R}$, it is called the Ornstein-Uhlenbeck process.

Periodic kernel

• Has the form \begin{aligned} K(x, x') = \sigma^2 \exp \left[ - \frac{2 \sin^2 (\pi \| x - x'\| / p) }{l^2} \right] \end{aligned}, where $\sigma \geq 0$, $l > 0$ and $p > 0$ are hyperparameters.
• $p$ is the period of the function, determining the distance between repetitions of the function.
• Good for modeling functions which repeat themselves exactly.
• Sometimes functions repeat themselves almost exactly, not exactly. In this situation, we can use the product of the periodic kernel with another kernel (the product of two kernels is itself a kernel). Such kernels are known as locally periodic kernels.

Linear/polynomial kernel

• The linear kernel has the form \begin{aligned} K(x, x') = x^T x' + \sigma^2 \end{aligned}, where $\sigma \geq 0$ is a hyperparameter.
• The polynomial kernel generalizes the linear kernel: \begin{aligned} K(x, x') = (x^T x' + \sigma^2)^d \end{aligned}, where $d \in \mathbb{N}$ is the degree of the polynomial, usually taken to be 2.
• It is a nonstationary kernel.

Brownian motion

• Brownian motion is a one-dimensional Gaussian process with mean zero and covariance function $K(x, x') = \min (x, x')$.

See this webpage for a longer list of kernels.

References:

1. Duvenaud, D. The Kernel Cookbook: Advice on Covariance functions.
2. Rasmussen, C. E., and Williams, C. K. I. (2006). Gaussian processes for machine learning. Chapter 4: Covariance Functions.
3. Snelson, E. (2006). Tutorial: Gaussian process models for machine learning.
4. Wikipedia. Matérn covariance function.