Equicorrelation matrix

The equicorrelation matrix is the \mathbb{R}^{n \times n} matrix where the entries on the diagonal are all equal to 1 and all off-diagonal entries are equal to some parameter \rho which lies in [-1, 1]. If we were to write out the matrix, it would look something like this:

{\bf \Sigma} = \begin{pmatrix} 1 & \rho & \rho & \dots & \rho \\ \rho & 1 & \rho & \dots & \rho \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho & \rho & \rho &\dots & 1 \end{pmatrix}

Alternatively, we can write it as {\bf \Sigma} = \rho {\bf 11}^T + (1-\rho){\bf I}, where {\bf 1} \in \mathbb{R}^n is the column vector with all entries being 1 and \bf I is the identity matrix.

Here are some useful properties of the equicorrelation matrix:

  • It is a Toeplitz matrix, and hence has the properties that all Toeplitz matrices has (see e.g. this link).
  • It has two eigenvalues. The first eigenvalue is 1 + (n-1)\rho, with associated eigenvector v_1 = {\bf 1}. The second eigenvalue is 1 - \rho, with n-1 associated eigenvectors v_2, \dots, v_n, where the entries of v_k are\begin{aligned} (v_k)_j = \begin{cases} 1 &\text{if } j = 1, \\ -1 &\text{if } j = k, \\ 0 &\text{otherwise.} \end{cases} \end{aligned}This can be verified directly by doing the matrix multiplication.
  • \text{det} ({\bf \Sigma}) = (1-\rho)^{n-1}[1 + (n-1)\rho]. This is because the determinant of a square matrix is equal to the product of its eigenvalues.
  • \bf \Sigma is positive definite if and only if -\frac{1}{n-1} < \rho < 1. A sketch of the proof can be found in the answer here. It boils down to proving some properties of the determinant expression in the previous point.
  • {\bf \Sigma}^{-1} = \dfrac{1}{1-\rho} \left( {\bf I} - \dfrac{\rho}{1 + (n-1)\rho} {\bf 11}^T \right). This can be verified directly by matrix multiplication. It can also be derived using the Sherman-Morrison formula.

What is a diagonally dominant matrix?

A diagonally dominant matrix is a square matrix A \in \mathbb{R}^{n \times n} such that for each i = 1, \dots, n,

\begin{aligned} |A_{ii}| \geq \sum_{j \neq i} |A_{ij}|. \end{aligned}

It is said to be strictly diagonally dominant if the inequality above is strict for all values of i.

In words, a diagonally dominant matrix is a square matrix such that in each row, the absolute value of the term on the diagonal is greater than or equal to the sum of absolute values of the rest of the terms in that row.

Properties

  1. A strictly diagonally dominant matrix is non-singular, i.e. has an inverse. This is known as the Levy-Desplanques theorem; a proof of the theorem can be found here.
  2. A symmetric diagonally dominant real matrix with non-negative diagonal entries is positive semidefinite (PSD). A proof can be found in the Wikipedia article (Reference 2). Similarly, a symmetric strictly diagonally dominant real matrix with positive diagonal entries is positive definite.

(Note: Diagonally dominant matrices can be defined for matrices with complex entries as well. The references and links in this post apply to matrices with complex entries; here I just focus on matrices with real entries.)

References:

  1. Wolfram MathWorld. Diagonally Dominant Matrix.
  2. Wikipedia. Diagonally dominant matrix.

The trace trick and the expectation of a quadratic form

The trace trick refers to the fact that a scalar quantity can be thought of as a 1 \times 1 matrix, and so is equal to its trace. This allows us to use properties of the trace in manipulating the object.

One example where the trace trick is useful is in proving the following lemma for the expectation of a quadratic form:

Lemma. Suppose Y \in \mathbb{R}^n is a random vector such that \mathbb{E}[Y] = \mu and \text{Var } Y = \Sigma. Then for any fixed matrix A \in \mathbb{R}^{n \times n},

\begin{aligned} \mathbb{E} \left[ Y^T AY \right] = \mu^T A \mu + \text{tr}(A\Sigma). \end{aligned}

Here is the proof: Note that

\begin{aligned} Y^T AY &= [\mu + (Y-\mu)]^T A [\mu + (Y-\mu)] \\  &= \mu^T A \mu + (Y-\mu)^T A\mu + \mu^T A (Y-\mu) + (Y-\mu)^T A (Y-\mu). \end{aligned}

Taking expectations on both sides, the middle two terms on the RHS are equal to 0, giving

\begin{aligned} \mathbb{E}\left[Y^T AY\right] = \mu^T A \mu + \mathbb{E} \left[ (Y-\mu)^T A (Y-\mu) \right]. \end{aligned}

Note that the expression inside the expectation for the second term on the RHS is a 1 \times 1 matrix. Hence, applying the trace trick and using the fact that the matrix trace is invariant under cyclic permutations,

\begin{aligned} \mathbb{E}\left[Y^T AY\right] &= \mu^T A \mu + \mathbb{E} \left[ \text{tr} \left[ (Y-\mu)^T A (Y-\mu) \right] \right] \\  &= \mu^T A \mu + \mathbb{E} \left[ \text{tr} \left[  A (Y-\mu)(Y-\mu)^T \right] \right] \\  &= \mu^T A \mu + \text{tr} \left[ \mathbb{E} \left[  A (Y-\mu)(Y-\mu)^T \right] \right] &\text{(as trace is a linear operator)} \\  &= \mu^T A \mu + \text{tr} \left[ A \mathbb{E} \left[   (Y-\mu)(Y-\mu)^T \right] \right] \\  &= \mu^T A \mu + \text{tr} \left( A \Sigma \right). \end{aligned}

Properties of covariance matrices

This post lists some properties of covariance matrices that I often forget.

A covariance matrix \Sigma \in \mathbb{R}^{n \times n} is simply a matrix such that there exists some random vector X \in \mathbb{R}^n such that \sigma_{ij} = \text{Cov}(X_i, X_j) for all i and j.

Properties:

  1. \Sigma is symmetric since \text{Cov}(X_i, X_j) = \text{Cov}(X_j, X_i).
  2. \Sigma is positive semi-definite (PSD):
    \begin{aligned} u^T \Sigma u &= \sum_{i, j=1}^n u_i \Sigma_{ij}u_j \\  &= \sum_{i, j=1}^n \text{Cov}(u_i X_i, u_j X_j) \\ &= \text{Cov}\left( \sum_i u_i X_i, \sum_j u_j X_j \right) \geq 0. \end{aligned}
  3. Because \Sigma is PSD, all of its eigenvalues are non-negative. (If \Sigma was positive definite, then its eigenvalues would be positive.)
  4. Since \Sigma is real and symmetric, all of its eigenvalues are real, and there exists a real orthogonal matrix Q such that D = Q^T \Sigma Q is a diagonal matrix. (The entries along the diagonal of D are \Sigma‘s eigenvalues.)
  5. Since \Sigma‘s eigenvalues are all non-negative, \Sigma has a square root: \Sigma^{1/2} = QD^{1/2}. (If \Sigma is positive definite, then it has a negative square root as well:  \Sigma^{-1/2} = QD^{-1/2}.)

Note that the above apply to sample covariance matrices as well.

References:

  1. Wikipedia. Covariance matrix.
  2. Statistical Odds & Ends. Properties of real symmetric matrices.

Matrix inversion formula in block form

Let’s say we have a (square) block matrix

\begin{aligned} \bf M = \begin{pmatrix} \bf A & \bf B \\ \bf C & \bf D \end{pmatrix}, \end{aligned}

where {\bf A} \in \mathbb{R}^{m \times m}, {\bf B} \in \mathbb{R}^{m \times n}, {\bf C} \in \mathbb{R}^{n \times m} and {\bf D} \in \mathbb{R}^{n \times n}. If \bf A and \bf D are invertible, then the inverse of \bf M is given by

\begin{aligned} {\bf M}^{-1} &= \begin{pmatrix} \bf (A - BD^{-1}C)^{-1} & \bf -A^{-1}B(D - CA^{-1}B)^{-1} \\ \bf -D^{-1}C(A - BD^{-1}C)^{-1} & \bf (D - CA^{-1}B)^{-1} \end{pmatrix} \\  &=  \begin{pmatrix} \bf (A - BD^{-1}C)^{-1} & \bf -(A - BD^{-1}C)^{-1}BD^{-1} \\ \bf -(D - CA^{-1}B)^{-1}CA^{-1} & \bf (D - CA^{-1}B)^{-1} \end{pmatrix}. \end{aligned}

It can be shown that the two formulas are equal.

Special case 1: \bf C = 0, i.e. the matrix is a triangular block matrix.

In this case, the formula above simplifies to

\begin{aligned} \begin{pmatrix} \bf A & \bf B \\ \bf 0 & \bf D \end{pmatrix}^{-1} &= \begin{pmatrix} \bf A^{-1} & \bf -A^{-1}BD^{-1} \\ \bf 0 & \bf D^{-1} \end{pmatrix}. \end{aligned}

Special case 2: {\bf B = b} \in \mathbb{R}^m, {\bf C = b^T}, and {\bf D} = c \in \mathbb{R}.

(In this setting, \bf A is usually a symmetric matrix, making \bf M a symmetric matrix too.) Here, the formula above reduces to

\begin{aligned} \begin{pmatrix} \bf A & \bf b \\ \bf b^T & c \end{pmatrix}^{-1} &= \begin{pmatrix} \left( {\bf A} - \frac{1}{c}{\bf bb^T} \right)^{-1} & -\frac{1}{k}\bf A^{-1}b \\ -\frac{1}{k}\bf b^T A^{-1} & \frac{1}{k} \end{pmatrix} \\  &= \begin{pmatrix} {\bf A^{-1}} + \frac{1}{k} \bf A^{-1}bb^T A^{-1} & -\frac{1}{k}\bf A^{-1}b \\ -\frac{1}{k}\bf b^T A^{-1} & \frac{1}{k} \end{pmatrix}, \end{aligned}

where k = c - {\bf b^T A^{-1}b}. The second equality is due to the Sherman-Morrison formula.

When \bf A is symmetric, the last formula is particularly nice as the quantity on the RHS is a function of just k and \bf A^{-1} b.

One possible application of this formula is in Gaussian process regression (see Section 4 of this link for details). The covariance matrix at the (n+1)-st step has the form above, with \bf A being the covariance matrix at the n-th step. This formula makes it easy to invert the covariance matrix at each step.

References:

  1. General Formula: Matrix Inversion Lemma.

Power method for obtaining the top eigenvector

The power method (also known as power iteration) is a simple algorithm for obtaining the “top” eigenvector and eigenvalue of a matrix.

Let A be an n \times n matrix with real-valued entries. Assume that:

  1. A is diagonalizable, or equivalently, A has n linearly independent eigenvectors v_1, \dots, v_n. (From here on, I assume that the v_i‘s have unit norm.)
  2. We can order the eigenvectors such that the eigenvalues associated with them (i.e. \lambda_i such that Av_i = \lambda_i v_i) satisfy the ordering |\lambda_1 | > |\lambda_2| \geq \dots \geq |\lambda_n|. What is important here is the strict inequality between |\lambda_1 | and |\lambda_2|.

The power method can be described as follows:

  1. Initialize b_0 \in \mathbb{R}^n to be some non-zero vector.
  2. For each t = 0, 1, \dots, set b_{k+1} = \dfrac{Ab_k}{\| Ab_k \|}, and set \mu_{k+1} = b_{k+1}^T A b_{k+1}.

If A satisfies the two assumptions above and b_0 has a non-zero component in the direction of v_1, then b_k converges to v_1, and \mu_k converges to \lambda_1. (This reference points out that if \lambda_1 is negative, \{ b_k \} might end up “flip-flopping” between v_1 and -v_1. One way to avoid this is to always insist on b_k having a non-negative first entry.)

The proof is pretty simple and can be found on the first page of the second reference. Section 6.2 of the third reference has some practical tips for how to use the power method in practice.

Can we say something about the convergence rate of the power method? In essence, the convergence rate depends on how close |\lambda_1| is to |\lambda_2|: the larger \left| \dfrac{\lambda_1}{\lambda_2} \right| is, the faster the power method converges. Theorem 5.6 of the last reference makes this precise:

Write our initialization b_0 as a linear combination of the eigenvectors: b_0 = \sum_{i=1}^n \alpha_i v_i. Assume that \alpha_1 \neq 0. Then there exists a constant C > 0 such that

\begin{aligned} \| b_k - v_1 \|_2 \leq C \left| \frac{\lambda_2}{\lambda_1} \right|^k \end{aligned} for all k \geq 1.

What if we are interested in eigenvalues/eigenvectors other than the top ones? We can use the power method to find \lambda_1 and v_1. It is then easy to show that the matrix

A' = A - \lambda_1 v_1 v_1^T

has eigenvalues 0, \lambda_2, \dots, \lambda_n and the same eigenvectors as A. We can then apply the power method to A' to get \lambda_2 and v_2, and so on. This known as the power method with deflation. It should be noted that the deflation method becomes more inaccurate as we compute smaller and smaller eigenvalues/eigenvectors, because the errors of approximation for the earlier eigenvalues/eigenvectors add up.

References:

  1. Wikipedia. Power iteration.
  2. The Power Method.
  3. Numerical calculation of eigenvalues.
  4. Lambers, J. (2009-10). The Eigenvalue Problem: Power Iterations.
  5. https://math.unice.fr/~frapetti/CorsoF/cours4part2.pdf

Change of basis formula

When I ask you to picture the vector \begin{pmatrix} 2 \\ 2 \end{pmatrix}, most of you see this in your head:

To be precise, what you are picturing is the vector \begin{pmatrix} 2 \\ 2 \end{pmatrix} in the standard basis for \mathbb{R}^2: \mathcal{B} = \begin{pmatrix} e_1, e_2 \end{pmatrix}, where e_1 represents a unit vector in the direction of the positive x-axis, and e_2 represents a unit vector in the direction of the positive y-axis.

We write \begin{pmatrix} 2 \\ 2 \end{pmatrix}_{\mathcal{B}} = 2 e_1 + 2 e_2, and we think of \begin{pmatrix} 2 \\ 2 \end{pmatrix} as being the coordinates of this vector with respect to this basis.

Why this pedantry? We might, for example, want to express the vector above as a linear combination of different basis vectors, e.g. \mathcal{B}' = \begin{pmatrix} \begin{pmatrix} 1 \\ 1\end{pmatrix}_{\mathcal{B}}, \begin{pmatrix} 1 \\ -1\end{pmatrix}_{\mathcal{B}} \end{pmatrix}:

Using the notation above, we have

\begin{pmatrix} 2 \\ 2 \end{pmatrix}_{\mathcal{B}} = 2 e_1 + 2 e_2 = 2 \begin{pmatrix} 1 \\ 1\end{pmatrix} + 0 \begin{pmatrix} 1 \\ -1\end{pmatrix} = \begin{pmatrix} 2 \\ 0 \end{pmatrix}_{\mathcal{B}'}.

The vector has coordinates \begin{pmatrix} 2 \\ 0 \end{pmatrix} in this new basis.

Is there a way to get directly from the LHS to the RHS? The theorem below tells us how to:

Let \mathcal{B} = \begin{pmatrix} e_1, \dots, e_n \end{pmatrix} and \mathcal{B}' = \begin{pmatrix} e_1', \dots, e_n' \end{pmatrix} be two bases for a given space. Let M be an n \times n matrix such that for any i = 1, \dots, n, e_j = \sum_{i=1}^n M_{ij} e_i'. (The jth column of M is e_j expressed as a linear combination in \mathcal{B}'.)

Then, for any x = \begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix}, \begin{pmatrix} x \end{pmatrix}_{\mathcal{B}} = \begin{pmatrix} Mx \end{pmatrix}_{\mathcal{B}'}.

The proof consists solely of matrix algebra:

\begin{aligned} \begin{pmatrix} x \end{pmatrix}_{\mathcal{B}} &= \sum_{j=1}^n x_j e_j \\  &= \sum_{i=1}^n x_j \left(\sum_{i=1}^n M_{ij} e_i' \right) \\  &= \sum_{i=1}^n \left( \sum_{j=1}^n M_{ij} x_j \right) e_i' \\  &= \begin{pmatrix} Mx \end{pmatrix}_{\mathcal{B}'}. \end{aligned}

Let’s see this theorem in action for our example above. With respect to the standard coordinate system, we have

e_1 = \begin{pmatrix} 1 \\ 0 \end{pmatrix}, e_2 = \begin{pmatrix} 0 \\ 1 \end{pmatrix}, e_1' = \begin{pmatrix} 1 \\ 1 \end{pmatrix}, e_2' = \begin{pmatrix} 1 \\ -1 \end{pmatrix},

and so

\begin{aligned} e_1 &= \frac{1}{2} e_1' + \frac{1}{2} e_2', \\  e_2 &= \frac{1}{2} e_1' - \frac{1}{2} e_2', \end{aligned}

which means M = \frac{1}{2}\begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}. Hence,

\begin{aligned} \begin{pmatrix} 2 \\ 2 \end{pmatrix}_{\mathcal{B}} &= \left( \frac{1}{2}\begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}\begin{pmatrix} 2 \\ 2 \end{pmatrix}\right)_{\mathcal{B}'} = \left( \frac{1}{2}\begin{pmatrix} 4 \\ 0 \end{pmatrix}\right)_{\mathcal{B}'} \\  &= \begin{pmatrix} 2 \\ 0 \end{pmatrix}_{\mathcal{B}'}, \end{aligned}

as we saw earlier.

Pseudoinverses and linear systems

In this last post, I defined matrix pseudoinverses and how to construct them. In this post, I will explain how pseudoinverses are useful for solving linear systems. By linear system, I mean the set of equations represented by

\begin{aligned} Ax = b, \end{aligned}

where A \in \mathbb{R}^{m \times n} is a known matrix, b \in \mathbb{R}^m is a known vector, and x \in \mathbb{R}^n is the variable we wish to solve for. Recall that there may be 0, 1 or an infinite number of solutions to this system of equations. It turns out that the “best” solution (“best” to be defined) can be expressed with a pseudoinverse. The following proposition makes this precise:

Proposition: Let z = A^+ b. Then:

  1. For all x \in \mathbb{R}^n, we have \| Ax - b\|_2 \geq \| Az - b\|_2. In other words, z = A^+ b is the least-squares solution of the linear system.
  2. If Ax = b is solvable, then z = A^+ b is a solution. In fact, for any other solution x, we have \| z \|_2 \leq \|x \|_2, i.e. z is the minimum \ell_2-norm solution.

Proof of proposition: Recall the 4 properties of the pseudoinverse:

  1. AA^+ A = A
  2. A^+ A A^+ = A^+
  3. (AA^+)^T = AA^+
  4. (A^+ A)^T = A^+ A

Using these properties, we have

\begin{aligned} A^T (Az - b) &= A^T (AA^+ b - b) \\  &= A^T (AA^+)^T b - A^T b &\text{(by condition 3)} \\  &= (AA^+ A)^T b - A^T b = A^T b - A^Tb &\text{(by condition 1)} \\  &= 0. \end{aligned}

Hence, for all x \in \mathbb{R}^n,

\begin{aligned} \|Ax - b\|_2^2 &= \| A(x - z) + Az - b \|_2^2 \\  &= \|A(x-z)\|_2^2 + \|Az - b\|_2^2 + 2 [A(x-z)]^T (Az - b) \\  &=\|A(x-z)\|_2^2 + \|Az - b\|_2^2 + 2(x-z)^T [A^T (Az - b)] \\  &= \|A(x-z)\|_2^2 + \|Az - b\|_2^2 \\  &\geq \|Az - b\|_2^2. \end{aligned}

This establishes part 1 of the proposition.

The first statement of part 2 follows immediately: since the system is solvable, there is some x^* such that Ax^* = b, implying that 0 \leq \|Az -b\|_2 \leq \|Ax^* - b\|_2 = 0, i.e. z is also a solution. To show that z is the minimum-norm solution: Let x be any solution to the linear system. Note that

\begin{aligned} z^T (x - z) &= (A^+ A A^+ b)^T (x - z) &\text{(by condition 2)} \\  &= (A^+ b)^T A^+ A (x - A^+ b) &\text{(by condition 4)} \\  &= (A^+ b)^T (A^+ b - A^+ b) &\text{(by condition 2)} \\  &= 0. \end{aligned}

Hence,

\begin{aligned} \|x\|_2^2 &= \|z + (x-z) \|_2^2 \\  &= \|z\|_2^2 + \|x\|_2^2 + 2 z^T (x-z) \\  &=\|z\|_2^2 + \|x\|_2^2 + 0 \\  &\geq \|z\|_2^2. \end{aligned}

This completes the proof of the proposition. \blacksquare

How is this proposition relevant to statistics? Consider the linear regression problem, where we have response y \in \mathbb{R}^n which we wish to predict using linear regression with a data matrix X \in \mathbb{R}^{n \times p}. With the least squares criterion, this amounts to

\underset{\beta \in \mathbb{R}^p}{\text{minimize}} \quad \| y - X\beta\|_2^2.

We can solve this by taking the derivative w.r.t. \beta and setting it to zero, i.e.

\begin{aligned} 2 (-X)^T (y - X\beta) &= 0, \\  X^T X \beta &= X^T y. \end{aligned}

If the above is solvable, then the proposition tells us that \beta = (X^T X)^+ X^T y is the solution to the least squares criterion with minimum \ell_2 norm.

References:

  1. Moore-Penrose inverse. Wikipedia.
  2. Proofs involving the Moore-Penrose inverse. Wikipedia.

Matrix pseudoinverses

There are some things that are never formally taught (in both undergrad and grad school) but are assumed to be known by grad students. In statistics, one of these things is pseudoinverses. In this post, we will define the matrix pseudoinverse, prove that it exists and that it is unique.

If A \in \mathbb{R}^{n \times n} is a square matrix such that \det A \neq 0, then it has an inverse, denoted A^{-1}, such that

\begin{aligned} A^{-1}A = A A^{-1} = I, \end{aligned}

where I is the identity matrix. There are two issues with this definition: (i) it only applies to square matrices, and (ii) it only applies when the determinant is non-zero. The pseudoinverse, also known as the Moore-Penrose inverse, is a generalization of the inverse which fixes these two issues.

Definition

Let A \in \mathbb{R}^{m \times n} be a matrix with real entries. Its pseudoinverse is defined as the unique matrix, denoted A^+ \in \mathbb{R}^{n \times m} which satisfies the following 4 properties:

  1. AA^+ A = A
  2. A^+ A A^+ = A^+
  3. (AA^+)^T = AA^+
  4. (A^+ A)^T = A^+ A

If A has the usual matrix inverse A^{-1}, it is easy to see that A^{-1} satisfies the 4 conditions above. It’s also not too hard to see that the pseudoinverse of a pseudoinverse is the original matrix, i.e. (A^+)^+ = A, and that the pseudoinverse of the transpose is the transpose of the pseudoinverse, i.e. (A^T)^+ = (A^+)^T.

Existence

A definition for the pseudoinverse does not guarantee that it exists. It turns out that for every matrix, we can construct its pseudoinverse pretty easily. Let’s first construct this for rectangular diagonal matrices D \in \mathbb{R}^{m \times n}, i.e. D_{ij} = 0 for all i \neq j.

Let D^+ \in \mathbb{R}^{n \times m} be the matrix such that

\begin{aligned} D^+_{ij} = \begin{cases} \frac{1}{D_{ij}} &\text{if } i = j \text{ and } D_{ij} \neq 0 \\ 0 &\text{otherwise.} \end{cases} \end{aligned}

It can be verified by direct matrix multiplication that the 4 pseudoinverse conditions are satisfied.

For the general matrix A \in \mathbb{R}^{m \times n}, recall that it has a singular value decomposition A = UDV^T, where U \in \mathbb{R}^{m \times m} and V \in \mathbb{R}^{n \times n} are orthonormal, and D is a rectangular diagonal matrix. If we define A^+ = VD^+ U^T, we can verify (again by direct matrix multiplication) that the 4 pseudoinverse conditions are satisfied.

Uniqueness

Next, we show that the pseudoinverse is unique. Let B and C be pseudoinverses for A. We have

\begin{aligned} AB &= ACAB &\text{(by condition 1)} \\  &= (AC)^T (AB)^T &\text{(by condition 3)} \\  &= C^T A^T B^T A^T = C^T (ABA)^T \\  &= C^T A^T &\text{(by condition 1)} \\  &= (AC)^T \\  &= AC. &\text{(by condition 3)} \end{aligned}

We can show analogously that BA = CA (using conditions 1 and 4). Hence,

\begin{aligned} B &= BAB &\text{(by condition 2)} \\  &= BAC = CAC \\  &= C. &\text{(by condition 2)} \end{aligned}

Hence, the pseudoinverse must be unique.

(Note: In this post I have restricted myself to matrices with real entries. Pseudoinverses work for matrices with complex entries as well, but the exposition is slightly more complicated and not relevant to the work I do in statistics.)

References:

  1. Moore-Penrose inverse. Wikipedia.
  2. Proofs involving the Moore-Penrose inverse. Wikipedia.

Matrix norms

This post describes the common matrix norms that I’ve encountered in my work in statistics. Unless otherwise stated, let X be an m \times n matrix with real-valued entries.

Operator norm

Let \| \cdot \| be some norm on \mathbb{R}^d. This norm induces an operator norm for matrices, defined by

\begin{aligned} \| X \|_{op} = \sup_{\| v \| = 1} \| X v \| = \sup_{v \neq 0} \frac{\| Xv\|}{\|v\|},  \end{aligned}

where v \in \mathbb{R}^n. Note that the operator norm depends on the underlying norm in \mathbb{R}^d.

When the underlying norm is the \ell_p norm, i.e. \| v \|_p = \left( \sum_i v_i^p \right)^{1/p}, we frequently write the operator norm as

\begin{aligned} \| X \|_p = \sup_{\| v \|_p = 1} \| X v \|_p = \sup_{v \neq 0} \frac{\| Xv\|_p}{\|v\|_p}.  \end{aligned}

When p = 2, the operator norm \| X \|_2 is simply the largest singular value of X.

Frobenius norm

The Frobenius norm of a matrix has a few equivalent definitions:

\begin{aligned} \| X \|_F &= \sqrt{\displaystyle\sum_{i=1}^m \sum_{j = 1}^n x_{ij}^2} \\ &= \sqrt{\text{tr}(X^T X)} = \sqrt{\text{tr}(XX^T)} \\ &= \sqrt{\sum_{i=1}^{\min(m,n)} \sigma_i^2(X) }, \end{aligned}

where the \sigma_i(X)‘s are the singular values of X. We can think of the Frobenius norm as an extension of the \ell_2 norm for vectors to matrices.

If X has rank r, we have the inequality

\| X \|_2 \leq \| X \|_F \leq \sqrt{r} \| X \|_2.

Nuclear norm

The nuclear norm of a matrix is given by

\begin{aligned} \| X \|_* = \text{tr}(\sqrt{X^T X}) = \sum_{i=1}^{\min (m,n)} \sigma_i(X). \end{aligned}

It is also known as the Schatten 1-norm or the trace norm.

Notice that the nuclear norm is equal to the \ell_1 norm of the singular values, while the rank of the matrix is equal to the \ell_0 norm of the singular values. This is why the nuclear norm is often used as a regularizer of objective functions of matrices. Often we want to find X which minimizes some convex objective f(X) subject to \text{rank}(X) \leq r, or equivalently

\text{minimize} \quad f(X) + \lambda \text{rank}(X).

This problem is non-convex in X, which makes optimization difficult. Instead of this, we could work with the optimization problem

\text{minimize} \quad f(X) + \lambda \|X\|_*.

This is a convex problem, which makes optimization much easier.

If X has rank r, we have the inequality

\| X \|_F \leq \| X \|_* \leq \sqrt{r} \| X \|_F.

References:

  1. Matrix norm. Wikipedia.