What is a dgCMatrix object made of? (sparse matrix format in R)

I’ve been working with sparse matrices in R recently (those created using Matrix::Matrix with the option sparse=TRUE) and found it difficult to track down documentation about what the slots in the matrix object are. This post describes the slots in a class dgCMatrix object.

(Click here for full documentation of the Matrix package (and it is a lot–like, 215 pages a lot).)

Background

It turns out that there is some documentation on dgCMatrix objects within the Matrix package. One can access it using the following code:

library(Matrix)
?`dgCMatrix-class`

According to the documentation, the dgCMatrix class

…is a class of sparse numeric matrices in the compressed, sparse, column-oriented format. In this implementation the non-zero elements in the columns are sorted into increasing row order. dgCMatrix is the “standard” class for sparse numeric matrices in the Matrix package.

An example

We’ll use a small matrix as a running example in this post:

library(Matrix)
M <- Matrix(c(0, 0,  0, 2,
              6, 0, -1, 5,
              0, 4,  3, 0,
              0, 0,  5, 0),
            byrow = TRUE, nrow = 4, sparse = TRUE)
rownames(M) <- paste0("r", 1:4)
colnames(M) <- paste0("c", 1:4)
M
# 4 x 4 sparse Matrix of class "dgCMatrix"
#    c1 c2 c3 c4
# r1  .  .  .  2
# r2  6  . -1  5
# r3  .  4  3  .
# r4  .  .  5  .

Running str on x tells us that the dgCMatrix object has 6 slots. (To learn more about slots and S4 objects, see this section from Hadley Wickham’s Advanced R.)

str(M)
# Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
# ..@ i       : int [1:7] 1 2 1 2 3 0 1
# ..@ p       : int [1:5] 0 1 2 5 7
# ..@ Dim     : int [1:2] 4 4
# ..@ Dimnames:List of 2
# .. ..$ : chr [1:4] "r1" "r2" "r3" "r4"
# .. ..$ : chr [1:4] "c1" "c2" "c3" "c4"
# ..@ x       : num [1:7] 6 4 -1 3 5 2 5
# ..@ factors : list()

x, i and p

If a matrix M has nn non-zero entries, then its x slot is a vector of length nn containing all the non-zero values in the matrix. The non-zero elements in column 1 are listed first (starting from the top and ending at the bottom), followed by column 2, 3 and so on.

M
# 4 x 4 sparse Matrix of class "dgCMatrix"
#    c1 c2 c3 c4
# r1  .  .  .  2
# r2  6  . -1  5
# r3  .  4  3  .
# r4  .  .  5  .

M@x
# [1]  6  4 -1  3  5  2  5

as.numeric(M)[as.numeric(M) != 0]
# [1]  6  4 -1  3  5  2  5

The i slot is a vector of length nn. The kth element of M@i is the row index of the kth non-zero element (as listed in M@x). One big thing to note here is that the first row has index ZERO, unlike R’s indexing convention. In our example, the first non-zero entry, 6, is in the second row, i.e. row index 1, so the first entry of M@i is 1.

M
# 4 x 4 sparse Matrix of class "dgCMatrix"
#    c1 c2 c3 c4
# r1  .  .  .  2
# r2  6  . -1  5
# r3  .  4  3  .
# r4  .  .  5  .

M@i
# [1] 1 2 1 2 3 0 1

If the matrix has nvars columns, then the p slot is a vector of length nvars + 1. If we index the columns such that the first column has index ZERO, then M@p[1] = 0, and M@p[j+2] - M@p[j+1] gives us the number of non-zero elements in column j.

In our example, when j = 2, M@p[2+2] - M@p[2+1] = 5 - 2 = 3, so there are 3 non-zero elements in column index 2 (i.e. the third column).

M
# 4 x 4 sparse Matrix of class "dgCMatrix"
#    c1 c2 c3 c4
# r1  .  .  .  2
# r2  6  . -1  5
# r3  .  4  3  .
# r4  .  .  5  .

M@p
# [1] 0 1 2 5 7

With the x, i and p slots, one can reconstruct the entries of the matrix.

Dim and Dimnames

These two slots are fairly obvious. Dim is a vector of length 2, with the first and second entries denoting the number of rows and columns the matrix has respectively. Dimnames is a list of length 2: the first element being a vector of row names (if present) and the second being a vector of column names (if present).

factors

This slot is probably the most unusual of the lot, and its documentation was a bit difficult to track down. From the CRAN documentation, it looks like factors is

… [an] Object of class “list” – a list of factorizations of the matrix. Note that this is typically empty, i.e., list(), initially and is updated automagically whenever a matrix factorization is
computed.

My understanding is if we perform any matrix factorizations or decompositions on a dgCMatrix object, it stores the factorization under factors so that if asked for the factorization again, it can return the cached value instead of recomputing the factorization. Here is an example:

M@factors
# list()

Mlu <- lu(M)  # perform triangular decomposition
str(M@factors)
# List of 1
# $ LU:Formal class 'sparseLU' [package "Matrix"] with 5 slots
# .. ..@ L  :Formal class 'dtCMatrix' [package "Matrix"] with 7 slots
# .. .. .. ..@ i       : int [1:4] 0 1 2 3
# .. .. .. ..@ p       : int [1:5] 0 1 2 3 4
# .. .. .. ..@ Dim     : int [1:2] 4 4
# .. .. .. ..@ Dimnames:List of 2
# .. .. .. .. ..$ : chr [1:4] "r2" "r3" "r4" "r1"
# .. .. .. .. ..$ : NULL
# .. .. .. ..@ x       : num [1:4] 1 1 1 1
# .. .. .. ..@ uplo    : chr "U"
# .. .. .. ..@ diag    : chr "N"
# .. ..@ U  :Formal class 'dtCMatrix' [package "Matrix"] with 7 slots
# .. .. .. ..@ i       : int [1:7] 0 1 0 1 2 0 3
# .. .. .. ..@ p       : int [1:5] 0 1 2 5 7
# .. .. .. ..@ Dim     : int [1:2] 4 4
# .. .. .. ..@ Dimnames:List of 2
# .. .. .. .. ..$ : NULL
# .. .. .. .. ..$ : chr [1:4] "c1" "c2" "c3" "c4"
# .. .. .. ..@ x       : num [1:7] 6 4 -1 3 5 5 2
# .. .. .. ..@ uplo    : chr "U"
# .. .. .. ..@ diag    : chr "N"
# .. ..@ p  : int [1:4] 1 2 3 0
# .. ..@ q  : int [1:4] 0 1 2 3
# .. ..@ Dim: int [1:2] 4 4

Here is an example which shows that the decomposition is only performed once:

set.seed(1)
M <- runif(9e6)
M[sample.int(9e6, size = 8e6)] <- 0
M <- Matrix(M, nrow = 3e3, sparse = TRUE)

system.time(lu(M))
#   user  system elapsed
# 13.527   0.161  13.701

system.time(lu(M))
#   user  system elapsed
#      0       0       0

Equicorrelation matrix

The equicorrelation matrix is the \mathbb{R}^{n \times n} matrix where the entries on the diagonal are all equal to 1 and all off-diagonal entries are equal to some parameter \rho which lies in [-1, 1]. If we were to write out the matrix, it would look something like this:

{\bf \Sigma} = \begin{pmatrix} 1 & \rho & \rho & \dots & \rho \\ \rho & 1 & \rho & \dots & \rho \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho & \rho & \rho &\dots & 1 \end{pmatrix}

Alternatively, we can write it as {\bf \Sigma} = \rho {\bf 11}^T + (1-\rho){\bf I}, where {\bf 1} \in \mathbb{R}^n is the column vector with all entries being 1 and \bf I is the identity matrix.

Here are some useful properties of the equicorrelation matrix:

  • It is a Toeplitz matrix, and hence has the properties that all Toeplitz matrices has (see e.g. this link).
  • It has two eigenvalues. The first eigenvalue is 1 + (n-1)\rho, with associated eigenvector v_1 = {\bf 1}. The second eigenvalue is 1 - \rho, with n-1 associated eigenvectors v_2, \dots, v_n, where the entries of v_k are\begin{aligned} (v_k)_j = \begin{cases} 1 &\text{if } j = 1, \\ -1 &\text{if } j = k, \\ 0 &\text{otherwise.} \end{cases} \end{aligned}This can be verified directly by doing the matrix multiplication.
  • \text{det} ({\bf \Sigma}) = (1-\rho)^{n-1}[1 + (n-1)\rho]. This is because the determinant of a square matrix is equal to the product of its eigenvalues.
  • \bf \Sigma is positive definite if and only if -\frac{1}{n-1} < \rho < 1. A sketch of the proof can be found in the answer here. It boils down to proving some properties of the determinant expression in the previous point.
  • {\bf \Sigma}^{-1} = \dfrac{1}{1-\rho} \left( {\bf I} - \dfrac{\rho}{1 + (n-1)\rho} {\bf 11}^T \right). This can be verified directly by matrix multiplication. It can also be derived using the Sherman-Morrison formula.

What is a diagonally dominant matrix?

A diagonally dominant matrix is a square matrix A \in \mathbb{R}^{n \times n} such that for each i = 1, \dots, n,

\begin{aligned} |A_{ii}| \geq \sum_{j \neq i} |A_{ij}|. \end{aligned}

It is said to be strictly diagonally dominant if the inequality above is strict for all values of i.

In words, a diagonally dominant matrix is a square matrix such that in each row, the absolute value of the term on the diagonal is greater than or equal to the sum of absolute values of the rest of the terms in that row.

Properties

  1. A strictly diagonally dominant matrix is non-singular, i.e. has an inverse. This is known as the Levy-Desplanques theorem; a proof of the theorem can be found here.
  2. A symmetric diagonally dominant real matrix with non-negative diagonal entries is positive semidefinite (PSD). A proof can be found in the Wikipedia article (Reference 2). Similarly, a symmetric strictly diagonally dominant real matrix with positive diagonal entries is positive definite.

(Note: Diagonally dominant matrices can be defined for matrices with complex entries as well. The references and links in this post apply to matrices with complex entries; here I just focus on matrices with real entries.)

References:

  1. Wolfram MathWorld. Diagonally Dominant Matrix.
  2. Wikipedia. Diagonally dominant matrix.

Properties of covariance matrices

This post lists some properties of covariance matrices that I often forget.

A covariance matrix \Sigma \in \mathbb{R}^{n \times n} is simply a matrix such that there exists some random vector X \in \mathbb{R}^n such that \sigma_{ij} = \text{Cov}(X_i, X_j) for all i and j.

Properties:

  1. \Sigma is symmetric since \text{Cov}(X_i, X_j) = \text{Cov}(X_j, X_i).
  2. \Sigma is positive semi-definite (PSD):
    \begin{aligned} u^T \Sigma u &= \sum_{i, j=1}^n u_i \Sigma_{ij}u_j \\  &= \sum_{i, j=1}^n \text{Cov}(u_i X_i, u_j X_j) \\ &= \text{Cov}\left( \sum_i u_i X_i, \sum_j u_j X_j \right) \geq 0. \end{aligned}
  3. Because \Sigma is PSD, all of its eigenvalues are non-negative. (If \Sigma was positive definite, then its eigenvalues would be positive.)
  4. Since \Sigma is real and symmetric, all of its eigenvalues are real, and there exists a real orthogonal matrix Q such that D = Q^T \Sigma Q is a diagonal matrix. (The entries along the diagonal of D are \Sigma‘s eigenvalues.)
  5. Since \Sigma‘s eigenvalues are all non-negative, \Sigma has a square root: \Sigma^{1/2} = QD^{1/2}. (If \Sigma is positive definite, then it has a negative square root as well:  \Sigma^{-1/2} = QD^{-1/2}.)

Note that the above apply to sample covariance matrices as well.

References:

  1. Wikipedia. Covariance matrix.
  2. Statistical Odds & Ends. Properties of real symmetric matrices.

Multivariate differentiation: simple rules

In this previous post, we defined the Jacobian for a multivariate function f: \mathbb{R}^n \mapsto \mathbb{R}^m which maps \mathbf{x} = (x_1, \dots, x_n)^T \in \mathbb{R}^n to f(\mathbf{x}) \in \mathbb{R}^m. It is the multivariate analog of the derivative, and if all the partial derivatives exist, is simply the m \times n matrix

J_f = \begin{pmatrix} \dfrac{\partial f_1}{\partial x_1} & \dots & \dfrac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial f_m}{\partial x_1} & \dots & \dfrac{\partial f_m}{\partial x_n} \end{pmatrix},

In the previous post, we also presented the chain rule, product rule and dot product rule. In this post, I’ll present some Jacobian formulas/rules for simple functions. In what follows, the only variable is \mathbf{x}: all other symbols denote constants.

The Jacobian for linear operators and quadratic forms have easy formulas which can be derived by explicitly computing each partial derivative:

Rule 1: If f(\mathbf{x}) = \mathbf{Ax}, where \mathbf{A} \in \mathbb{R}^{m \times n}, then J_f (\mathbf{x}) = \mathbf{A}.

Rule 2: If f(\mathbf{x}) = \mathbf{x^T Ax}, where \mathbf{A} \in \mathbb{R}^{n \times n}, then J_f (\mathbf{x}) = \mathbf{x^T(A + A^T)}.

Let’s use the rules here and in the previous post to derive the ordinary least squares solution. In that setting, let \mathbf{y} \in \mathbb{R}^n, and let \mathbf{X} \in \mathbb{R}^{n \times p} with n > p have full column rank. We wish to find \beta \in \mathbb{R}^p which minimizes

f(\beta) = \| \mathbf{y} - \mathbf{X}\beta \|_2^2 = (\mathbf{y} - \mathbf{X}\beta) \cdot (\mathbf{y} - \mathbf{X}\beta).

Letting g(\beta) = \mathbf{y} - \mathbf{X}\beta, Rule 1 tells us that J_g(\beta) = (-\mathbf{X}). Using the dot-product rule,

\begin{aligned} \nabla f &= \nabla (g \cdot g) \\  &= J_g^T g + J_g^T g \\  &= 2(-\mathbf{X})^T (\mathbf{y} - \mathbf{X}\beta) \\  &=-2\mathbf{X}^T\mathbf{y} + 2\mathbf{X^TX}\beta. \end{aligned}

Setting it equal to zero and noting that \mathbf{X^T X} is positive definite (proof here) and hence has an inverse, we can rearrange to obtain the solution \hat{\beta} = (\mathbf{X^T X})^{-1} \mathbf{X}^T\mathbf{y}. Using Rule 1, the Hessian \nabla^2 f =\mathbf{X^TX} \succ 0 is positive definite for all \beta, and so \hat{\beta} is a local minimum. It’s the only stationary point, and so it is a global minimum as well.

References:

  1. Matrix Differentiation. Randal J. Barnes.

Multivariate differentiation

Let f: \mathbb{R}^n \mapsto \mathbb{R}^m be a function which maps \mathbf{x} = (x_1, \dots, x_n)^T \in \mathbb{R}^n to f(\mathbf{x}) \in \mathbb{R}^m. For i=1, \dots m, let f_i denote the ith coordinate of f, i.e.

f(\mathbf{x}) = \begin{pmatrix} f_1(\mathbf{x}) \\ \vdots \\ f_m (\mathbf{x}) \end{pmatrix}.

The Jacobian matrix of f, denoted J_f, is the m \times n matrix of derivatives:

J_f = \begin{pmatrix} \dfrac{\partial f_1}{\partial x_1} & \dots & \dfrac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial f_m}{\partial x_1} & \dots & \dfrac{\partial f_m}{\partial x_n} \end{pmatrix},

or more succinctly, (J_f)_{ij} = \dfrac{\partial f_i}{\partial x_j} for all i and j. The number of rows is equal to the number of coordinate functions f_i while the number of columns is equal to the number of variables x_j. J_f can also be thought of as a function from \mathbb{R}^n to \mathbb{R}^{m \times n}: for each point \mathbf{x} \in \mathbb{R}^n we have a corresponding Jacobian matrix J_f(\mathbf{x}).

The Jacobian is the multivariate equivalent of a derivative: if f is differentiable at \mathbf{x}, then the mapping u \mapsto J_f(\mathbf{x}) u (which goes from \mathbb{R}^n to \mathbb{R}^m is the best pointwise linear approximation of the function f at \mathbf{x}.

The chain rule holds for these multivariate derivatives. Let f: \mathbb{R}^n \mapsto \mathbb{R}^m and g: \mathbb{R}^p \mapsto \mathbb{R}^n. Then

Chain rule: J_{f \circ g}(\mathbf{x}) = J_f [g(\mathbf{x})] J_g (\mathbf{x}).

The special case of m = 1

When m = 1, i.e. f : \mathbb{R}^n \mapsto \mathbb{R}, the Jacobian J_f is a 1 \times n matrix, or a row vector. Since we are more accustomed to talking about vectors as columns, we define the gradient of f as

\nabla f = (J_f)^T \in \mathbb{R}^n.

The gradient of f is itself a mapping from \mathbb{R}^n to \mathbb{R}^n, so we can talk about its Jacobian as well (without having to resort to tensors). The Jacobian of the gradient is known as the Hessian. If all of f‘s second partial derivatives exist and are continuous over the domain of the function, then the Hessian matrix is given by

H = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1\partial x_2} &\dots & \frac{\partial^2 f}{\partial x_1\partial x_n} \\  \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} &\dots & \frac{\partial^2 f}{\partial x_2\partial x_n} \\  \vdots & \vdots &\ddots &\vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} &\dots & \frac{\partial^2 f}{\partial x_n^2}  \end{pmatrix}.

If f: \mathbb{R}^n \mapsto \mathbb{R} and g: \mathbb{R}^n \mapsto \mathbb{R}, we can talk about a multivariate extension of the product rule. Let fg be defined as the point-wise product of f and g, i.e. fg(\mathbf{x}) = f(\mathbf{x}) g(\mathbf{x}). If we define

p \left( \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} \right) = x_1 x_2    and    q(\mathbf{x}) = \begin{pmatrix} f(\mathbf{x}) \\ g(\mathbf{x}) \end{pmatrix},

then fg(\mathbf{x}) = p (q (\mathbf{x})). This allows us to apply the multivariate chain rule and obtain

\begin{aligned} J_{fg}(\mathbf{x}) = J_{p \circ q}(\mathbf{x}) &= J_p [q(\mathbf{x})] J_q (\mathbf{x}) \\  &= \begin{pmatrix} g(\mathbf{x}) & f(\mathbf{x}) \end{pmatrix} \begin{pmatrix} \nabla f (\mathbf{x})^T \\ \nabla g(\mathbf{x})^T \end{pmatrix} \\  &= g(\mathbf{x}) \nabla f (\mathbf{x})^T + f(\mathbf{x}) \nabla g (\mathbf{x})^T.  \end{aligned}

Taking transposes, we get

Product rule: \nabla (fg) (\mathbf{x}) =g(\mathbf{x}) \nabla f (\mathbf{x}) + f(\mathbf{x}) \nabla g (\mathbf{x}).

A dot-product rule

Consider the setting f: \mathbb{R}^n \mapsto \mathbb{R}^m and g: \mathbb{R}^n \mapsto \mathbb{R}^m. The dot product mapping from \mathbb{R}^n to \mathbb{R} is well defined: (f \cdot g) (\mathbf{x}) = f(\mathbf{x}) \cdot g(\mathbf{x}) =f(\mathbf{x})^T g(\mathbf{x}). Applying the product rule above by coordinate,

\begin{aligned} \nabla (f \cdot g)(\mathbf{x}) &= \sum_{i=1}^m \nabla [f_i(\mathbf{x}) g_i(\mathbf{x})] \\  &= \sum_{i=1}^m g_i(\mathbf{x}) \nabla f_i (\mathbf{x}) + f_i(\mathbf{x}) \nabla g_i (\mathbf{x}) \\  &= \begin{pmatrix} | & & | \\ \nabla f_1 &\dots & \nabla f_m \\ | & & | \end{pmatrix} g(\mathbf{x}) + \begin{pmatrix} | & & | \\ \nabla g_1 &\dots & \nabla g_m \\ | & & | \end{pmatrix} f(\mathbf{x}), \end{aligned}.

that is,

Dot-product rule: \nabla (f \cdot g)(\mathbf{x}) =J_f(\mathbf{x})^T g(\mathbf{x}) + J_g(\mathbf{x})^T f(\mathbf{x}).

A coordinate-wise rule

Let f: \mathbb{R} \mapsto \mathbb{R}. Let \tilde{f}:\mathbb{R}^n \mapsto \mathbb{R} be the function where f is applied to each coordinate of \mathbf{x}, i.e.

\tilde{f}(\mathbf{x}) = \begin{pmatrix} f(x_1) \\ \vdots \\ f(x_n) \end{pmatrix}.

(Sometimes we “overload” the symbol f to denote both f and \tilde{f}, since what is meant can be deduced from context.) The Jacobian of \tilde{f} can be easily determined by computing the partial derivatives:

\begin{aligned} \dfrac{\partial \tilde{f}_i}{\partial x_j} = \dfrac{\partial f(x_i)}{\partial x_j} = \begin{cases} f'(x_i) &\text{if } i = j, \\ 0 &\text{if } i \neq j. \end{cases} \end{aligned}

The Jacobian is a diagonal matrix! Using the overloaded notation, we have

Coordinate-wise rule: J_{\tilde{f}}(\mathbf{x}) = \text{diag}[f'(\mathbf{x})].

References:

  1. Hessian matrix. Wikipedia.
  2. Jacobian matrix. Wikipedia.

Pseudoinverses and linear systems

In this last post, I defined matrix pseudoinverses and how to construct them. In this post, I will explain how pseudoinverses are useful for solving linear systems. By linear system, I mean the set of equations represented by

\begin{aligned} Ax = b, \end{aligned}

where A \in \mathbb{R}^{m \times n} is a known matrix, b \in \mathbb{R}^m is a known vector, and x \in \mathbb{R}^n is the variable we wish to solve for. Recall that there may be 0, 1 or an infinite number of solutions to this system of equations. It turns out that the “best” solution (“best” to be defined) can be expressed with a pseudoinverse. The following proposition makes this precise:

Proposition: Let z = A^+ b. Then:

  1. For all x \in \mathbb{R}^n, we have \| Ax - b\|_2 \geq \| Az - b\|_2. In other words, z = A^+ b is the least-squares solution of the linear system.
  2. If Ax = b is solvable, then z = A^+ b is a solution. In fact, for any other solution x, we have \| z \|_2 \leq \|x \|_2, i.e. z is the minimum \ell_2-norm solution.

Proof of proposition: Recall the 4 properties of the pseudoinverse:

  1. AA^+ A = A
  2. A^+ A A^+ = A^+
  3. (AA^+)^T = AA^+
  4. (A^+ A)^T = A^+ A

Using these properties, we have

\begin{aligned} A^T (Az - b) &= A^T (AA^+ b - b) \\  &= A^T (AA^+)^T b - A^T b &\text{(by condition 3)} \\  &= (AA^+ A)^T b - A^T b = A^T b - A^Tb &\text{(by condition 1)} \\  &= 0. \end{aligned}

Hence, for all x \in \mathbb{R}^n,

\begin{aligned} \|Ax - b\|_2^2 &= \| A(x - z) + Az - b \|_2^2 \\  &= \|A(x-z)\|_2^2 + \|Az - b\|_2^2 + 2 [A(x-z)]^T (Az - b) \\  &=\|A(x-z)\|_2^2 + \|Az - b\|_2^2 + 2(x-z)^T [A^T (Az - b)] \\  &= \|A(x-z)\|_2^2 + \|Az - b\|_2^2 \\  &\geq \|Az - b\|_2^2. \end{aligned}

This establishes part 1 of the proposition.

The first statement of part 2 follows immediately: since the system is solvable, there is some x^* such that Ax^* = b, implying that 0 \leq \|Az -b\|_2 \leq \|Ax^* - b\|_2 = 0, i.e. z is also a solution. To show that z is the minimum-norm solution: Let x be any solution to the linear system. Note that

\begin{aligned} z^T (x - z) &= (A^+ A A^+ b)^T (x - z) &\text{(by condition 2)} \\  &= (A^+ b)^T A^+ A (x - A^+ b) &\text{(by condition 4)} \\  &= (A^+ b)^T (A^+ b - A^+ b) &\text{(by condition 2)} \\  &= 0. \end{aligned}

Hence,

\begin{aligned} \|x\|_2^2 &= \|z + (x-z) \|_2^2 \\  &= \|z\|_2^2 + \|x\|_2^2 + 2 z^T (x-z) \\  &=\|z\|_2^2 + \|x\|_2^2 + 0 \\  &\geq \|z\|_2^2. \end{aligned}

This completes the proof of the proposition. \blacksquare

How is this proposition relevant to statistics? Consider the linear regression problem, where we have response y \in \mathbb{R}^n which we wish to predict using linear regression with a data matrix X \in \mathbb{R}^{n \times p}. With the least squares criterion, this amounts to

\underset{\beta \in \mathbb{R}^p}{\text{minimize}} \quad \| y - X\beta\|_2^2.

We can solve this by taking the derivative w.r.t. \beta and setting it to zero, i.e.

\begin{aligned} 2 (-X)^T (y - X\beta) &= 0, \\  X^T X \beta &= X^T y. \end{aligned}

If the above is solvable, then the proposition tells us that \beta = (X^T X)^+ X^T y is the solution to the least squares criterion with minimum \ell_2 norm.

References:

  1. Moore-Penrose inverse. Wikipedia.
  2. Proofs involving the Moore-Penrose inverse. Wikipedia.

Matrix pseudoinverses

There are some things that are never formally taught (in both undergrad and grad school) but are assumed to be known by grad students. In statistics, one of these things is pseudoinverses. In this post, we will define the matrix pseudoinverse, prove that it exists and that it is unique.

If A \in \mathbb{R}^{n \times n} is a square matrix such that \det A \neq 0, then it has an inverse, denoted A^{-1}, such that

\begin{aligned} A^{-1}A = A A^{-1} = I, \end{aligned}

where I is the identity matrix. There are two issues with this definition: (i) it only applies to square matrices, and (ii) it only applies when the determinant is non-zero. The pseudoinverse, also known as the Moore-Penrose inverse, is a generalization of the inverse which fixes these two issues.

Definition

Let A \in \mathbb{R}^{m \times n} be a matrix with real entries. Its pseudoinverse is defined as the unique matrix, denoted A^+ \in \mathbb{R}^{n \times m} which satisfies the following 4 properties:

  1. AA^+ A = A
  2. A^+ A A^+ = A^+
  3. (AA^+)^T = AA^+
  4. (A^+ A)^T = A^+ A

If A has the usual matrix inverse A^{-1}, it is easy to see that A^{-1} satisfies the 4 conditions above. It’s also not too hard to see that the pseudoinverse of a pseudoinverse is the original matrix, i.e. (A^+)^+ = A, and that the pseudoinverse of the transpose is the transpose of the pseudoinverse, i.e. (A^T)^+ = (A^+)^T.

Existence

A definition for the pseudoinverse does not guarantee that it exists. It turns out that for every matrix, we can construct its pseudoinverse pretty easily. Let’s first construct this for rectangular diagonal matrices D \in \mathbb{R}^{m \times n}, i.e. D_{ij} = 0 for all i \neq j.

Let D^+ \in \mathbb{R}^{n \times m} be the matrix such that

\begin{aligned} D^+_{ij} = \begin{cases} \frac{1}{D_{ij}} &\text{if } i = j \text{ and } D_{ij} \neq 0 \\ 0 &\text{otherwise.} \end{cases} \end{aligned}

It can be verified by direct matrix multiplication that the 4 pseudoinverse conditions are satisfied.

For the general matrix A \in \mathbb{R}^{m \times n}, recall that it has a singular value decomposition A = UDV^T, where U \in \mathbb{R}^{m \times m} and V \in \mathbb{R}^{n \times n} are orthonormal, and D is a rectangular diagonal matrix. If we define A^+ = VD^+ U^T, we can verify (again by direct matrix multiplication) that the 4 pseudoinverse conditions are satisfied.

Uniqueness

Next, we show that the pseudoinverse is unique. Let B and C be pseudoinverses for A. We have

\begin{aligned} AB &= ACAB &\text{(by condition 1)} \\  &= (AC)^T (AB)^T &\text{(by condition 3)} \\  &= C^T A^T B^T A^T = C^T (ABA)^T \\  &= C^T A^T &\text{(by condition 1)} \\  &= (AC)^T \\  &= AC. &\text{(by condition 3)} \end{aligned}

We can show analogously that BA = CA (using conditions 1 and 4). Hence,

\begin{aligned} B &= BAB &\text{(by condition 2)} \\  &= BAC = CAC \\  &= C. &\text{(by condition 2)} \end{aligned}

Hence, the pseudoinverse must be unique.

(Note: In this post I have restricted myself to matrices with real entries. Pseudoinverses work for matrices with complex entries as well, but the exposition is slightly more complicated and not relevant to the work I do in statistics.)

References:

  1. Moore-Penrose inverse. Wikipedia.
  2. Proofs involving the Moore-Penrose inverse. Wikipedia.

Matrix norms

This post describes the common matrix norms that I’ve encountered in my work in statistics. Unless otherwise stated, let X be an m \times n matrix with real-valued entries.

Operator norm

Let \| \cdot \| be some norm on \mathbb{R}^d. This norm induces an operator norm for matrices, defined by

\begin{aligned} \| X \|_{op} = \sup_{\| v \| = 1} \| X v \| = \sup_{v \neq 0} \frac{\| Xv\|}{\|v\|},  \end{aligned}

where v \in \mathbb{R}^n. Note that the operator norm depends on the underlying norm in \mathbb{R}^d.

When the underlying norm is the \ell_p norm, i.e. \| v \|_p = \left( \sum_i v_i^p \right)^{1/p}, we frequently write the operator norm as

\begin{aligned} \| X \|_p = \sup_{\| v \|_p = 1} \| X v \|_p = \sup_{v \neq 0} \frac{\| Xv\|_p}{\|v\|_p}.  \end{aligned}

When p = 2, the operator norm \| X \|_2 is simply the largest singular value of X.

Frobenius norm

The Frobenius norm of a matrix has a few equivalent definitions:

\begin{aligned} \| X \|_F &= \sqrt{\displaystyle\sum_{i=1}^m \sum_{j = 1}^n x_{ij}^2} \\ &= \sqrt{\text{tr}(X^T X)} = \sqrt{\text{tr}(XX^T)} \\ &= \sqrt{\sum_{i=1}^{\min(m,n)} \sigma_i^2(X) }, \end{aligned}

where the \sigma_i(X)‘s are the singular values of X. We can think of the Frobenius norm as an extension of the \ell_2 norm for vectors to matrices.

If X has rank r, we have the inequality

\| X \|_2 \leq \| X \|_F \leq \sqrt{r} \| X \|_2.

Nuclear norm

The nuclear norm of a matrix is given by

\begin{aligned} \| X \|_* = \text{tr}(\sqrt{X^T X}) = \sum_{i=1}^{\min (m,n)} \sigma_i(X). \end{aligned}

It is also known as the Schatten 1-norm or the trace norm.

Notice that the nuclear norm is equal to the \ell_1 norm of the singular values, while the rank of the matrix is equal to the \ell_0 norm of the singular values. This is why the nuclear norm is often used as a regularizer of objective functions of matrices. Often we want to find X which minimizes some convex objective f(X) subject to \text{rank}(X) \leq r, or equivalently

\text{minimize} \quad f(X) + \lambda \text{rank}(X).

This problem is non-convex in X, which makes optimization difficult. Instead of this, we could work with the optimization problem

\text{minimize} \quad f(X) + \lambda \|X\|_*.

This is a convex problem, which makes optimization much easier.

If X has rank r, we have the inequality

\| X \|_F \leq \| X \|_* \leq \sqrt{r} \| X \|_F.

References:

  1. Matrix norm. Wikipedia.

Matrix expansions

When doing matrix computations, I find it pretty hard to switch between the matrix form and element-wise form. This is a reference for some of the more common ones I come across. (Proofs are often just a matter of doing some tedious algebra.)

  1. The column space of an m \times n matrix A can be expressed as the set \{ Ax : x \in \mathbb{R}^m \}. We can also write Ax = \displaystyle\sum_{i = 1}^m x_i A_i, where A_i is the ith column of A.
  2. Similarly, the row space of A can be expressed as the set \{ A^Tx : x \in \mathbb{R}^n \}.
  3. If A \in \mathbb{R}^{m \times n}, x \in \mathbb{R}^m and y \in \mathbb{R}^n, then x^T A y = \displaystyle\sum_{i=1}^m\sum_{j=1}^n x_i A_{ij} y_j.
  4. If A \in \mathbb{R}^{m \times n} has columns A_1, \dots, A_n \in \mathbb{R}^{m \times 1}, and B \in \mathbb{R}^{n \times p} has rows B_1, \dots, B_n \in \mathbb{R}^{1 \times n}, then AB = \displaystyle\sum_{i=1}^n A_i B_i.
    In particular, A A^T =\displaystyle\sum_{i=1}^n A_i A_i^T, where A_i are the columns of A, and A^T A = \displaystyle\sum_{i=1}^m a_i a_i^T, where a_i are the rows of A (written as column vectors).
  5. If D is a diagonal matrix, U a matrix with columns V, and V a matrix with columns V_j, then UDV^T = \displaystyle\sum_j d_j U_j V_j^T.
  6. If A \in \mathbb{R}^{m \times n} and we want a matrix B whose ith row is c_i multiplied by the ith row of A, then B = \text{diag} (c_1, \dots, c_n) A.
    If we want a matrix B whose ith column is c_i multiplied by the ith column of A, then B =  A\text{diag} (c_1, \dots, c_n).
  7. If A and B are square matrices of the same size, then \text{tr}(A^TB) = \displaystyle\sum_{i,j} A_{ij}B_{ij}.