What is the Cauchy combination test?

Set-up

Assume that we have p-values p_1, \dots, p_d. Assume that they are computed from z-scores \mathbf{X} = (X_1, \dots, X_d) (test statistics following normal distributions). Let \mathbb{E}[\mathbf{X}] = \boldsymbol{\mu} and let \text{Cov}(\mathbf{X}) = \mathbf{\Sigma}. Without loss of generality, assume that each test statistic X_i has variance 1. With this, we can express the p-values as

\begin{aligned} p_i = 2 \left[ 1 - \Phi (|X_i|) \right], \end{aligned}

where \Phi is the CDF function of the standard normal distribution.

We are interested in testing the global null hypothesis H_0 :\boldsymbol{\mu} = \mathbf{0}.

The Cauchy combination test

Assume that we have some random vector w = (w_1, \dots, w_d) such that w_i \geq 0 for all i, w_1 + \dots + w_d = 0 and w is independent of \mathbf{X}. The test statistic for the Cauchy combination test, proposed by Liu & Xie 2020 (Reference 1), is

\begin{aligned} T(\mathbf{X}) = \sum_{i=1}^d w_i \tan \left[ \pi \left( 2 \Phi (|X_i|) - \frac{3}{2} \right) \right] = \sum_{i=1}^d w_i \tan \left[ \pi \left( \frac{1}{2} - p_i \right) \right]. \end{aligned}

Under the global null, p_i \sim \text{Unif}(0,1) for each i, implying that \tan \left[ \pi \left( \frac{1}{2} - p_i \right) \right] has the standard Cauchy distribution (see this previous post). If the p-values are independent, then Proposition 1 in this other previous post implies that T(\mathbf{X}) has the standard Cauchy distribution.

In turns out that even if the p-values are dependent, T(\mathbf{X}) is still approximately Cauchy! This approximation is formalized in Theorem 1 of Reference 1:

Theorem. Suppose that for any 1 \leq i < j \leq d, (X_i, X_j) follows a bivariate normal distribution. Suppose also that \mathbb{E}[\mathbf{X}] = \mathbf{0}. Let W_0 be a standard Cauchy random variable. Then for any fixed d and any correlation matrix \mathbf{\Sigma} \geq \mathbf{0}, we have

\begin{aligned} \lim_{t \rightarrow +\infty} \dfrac{\mathbb{P}\{ T(\mathbf{X}) > t \}}{\mathbb{P} \{ W_0 > t \} } = 1. \end{aligned}

The theorem says that the test statistic T(\mathbf{X}) has approximately a Cauchy tail even under dependency structures in \mathbf{X}. Knowing the (approximate) distribution of T(\mathbf{X}) under the global null allows us to use it as a test statistic.

Other notes

  • The “bivariate normal distribution” condition is a bit of a technical assumption, the authors claim it is a mild assumption.
  • This result bears a lot of similarity with a result by Pillai & Meng 2016 (see this previous post for a description). Section 2.2 of Reference 1 discusses the similarities and the differences.
  • Theorem 1 above holds for fixed d (number of p-values). Section 2.3 of Reference 1 has a high-dimensional asymptotic result where d = o(t^c) with 0 < c < 1/2.
  • The Cauchy combination test is especially powerful when only a small number of \mu_i‘s are large, or equivalently when a smaller number of p_i‘s are very small. We can see this intuitively: small p_i‘s become very large \tan [\pi(1/2 - p_i)]‘s, so the test statistic will be dominated by a few very large p-values. See Section 4.2 of Reference 1 for a power comparison study.

References:

  1. Liu, Y., and Xie, J. (2020). Cauchy Combination Test: A Powerful Test With Analytic p-Value Calculation Under Arbitrary Dependency Structures.

An interesting result involving the Cauchy distribution

If Z_1, \dots, Z_n are independent \text{Cauchy}(0, 1) variables and w= (w_1, \dots, w_n) is a random vector independent of the Z_i‘s with w_i \geq 0 for all i and w_1 + \dots w_n = 0, it is well-known that \displaystyle\sum_{i=1}^n w_i Z_i also has a \text{Cauchy}(0, 1) distribution.

It is also well-known that if X and Y are independent standard normal variables, then X / Y has a \text{Cauchy}(0, 1) distribution.

Putting these two facts together, we have the following proposition:

Proposition 1. Let \mathbf{X} = (X_1, \dots, X_n) and \mathbf{Y} = (Y_1, \dots, Y_n) be i.i.d. \mathcal{N}(\mathbf{0}, \mathbf{\Sigma}), where \mathbf{\Sigma} is a diagonal matrix with strictly positive entries on the diagonal. Let w = (w_1, \dots, w_n) be a random vector independent of (\mathbf{X},\mathbf{Y}) such that w_i \geq 0 for all i and w_1 + \dots w_n = 0. Then

\begin{aligned} \sum_{i=1}^n w_i \dfrac{X_j}{Y_j} \sim \text{Cauchy}(0,1). \end{aligned}

Interestingly, the result above holds for any covariance matrix \mathbf{\Sigma} with positive entries on the diagonal! This was proved in Pillai & Meng 2016 (Reference 1). In other words, we can have any dependence structure between the elements of \mathbf{X} (and likewise for \mathbf{Y}, but as long as the dependence structure for the two random vectors is the same, the linear combination of their ratios remains Cauchy-distributed! Here is the formal statement:

Proposition 2. Let \mathbf{X} = (X_1, \dots, X_n) and \mathbf{Y} = (Y_1, \dots, Y_n) be i.i.d. \mathcal{N}(\mathbf{0}, \mathbf{\Sigma}), where \mathbf{\Sigma} is any covariance matrix with strictly positive entries on the diagonal. Let w = (w_1, \dots, w_n) be a random vector independent of (\mathbf{X},\mathbf{Y}) such that w_i \geq 0 for all i and w_1 + \dots w_n = 0. Then

\begin{aligned} \sum_{i=1}^n w_i \dfrac{X_j}{Y_j} \sim \text{Cauchy}(0,1). \end{aligned}

References:

  1. Pillai, N. S., and Meng, X.-L. (2016). An Unexpected Encounter with Cauchy and Lévy.

Converting a uniform distribution into a Cauchy distribution

The following proposition demonstrates how one can transform a uniform distribution on (0, 1) into a standard Cauchy distribution:

Proposition. If U \sim \text{Unif}(0,1) and X = \tan \left[ \pi \left( \frac{1}{2} - U \right)\right], then X has standard Cauchy distribution.

Proof: For any x \in \mathbb{R},

\begin{aligned} \mathbb{P} \left\{ X \leq x \right\} &= \mathbb{P} \left\{ \tan \left[ \pi \left( \frac{1}{2} - U \right)\right] \leq x \right\} \\  &= \mathbb{P} \left\{ \frac{1}{2} - U \leq \frac{1}{\pi}\tan^{-1} x \right\} \\  &= \mathbb{P} \left\{ \frac{1}{2} - \frac{1}{\pi}\tan^{-1} x \leq U  \right\} \\  &= 1 - \left[ \frac{1}{2} - \frac{1}{\pi}\tan^{-1} x \right] \\  &= \frac{1}{2} + \frac{1}{\pi}\tan^{-1} x, \end{aligned}

where the second-last equality is due to the fact that \tan^{-1} x \in (-\frac{1}{2}, \frac{1}{2}), so the LHS inside the probability is always in (0, 1). In other words, X has the CDF of the standard Cauchy distribution.

We could also differentiate the CDF above w.r.t. x to get the PDF of X, which is more easily recognizable as the PDF of the standard Cauchy:

\begin{aligned} f_X(x) &= \dfrac{d}{dx} \left[ \frac{1}{2} + \frac{1}{\pi}\tan^{-1} x \right] \\  &= \frac{1}{\pi (1 + x^2)}. \end{aligned}

(Alternatively, one could use the formulas in this previous post with minor modifications since the transformation here is monotonically decreasing, not increasing.)

What is Adam and AdamW?

Adam (short for adaptive moment estimation) was proposed by Kingma & Ba (2014) (Reference 1) as a first-order gradient-based optimization method that combines the advantages of AdaGrad (Duchi et al. 2011, Reference 2) (ability to deal with sparse gradients) and RMSProp (Tieleman & Hinton 2012, Reference 3) (ability to deal with non-stationary objectives). While considered an old method by now, it’s still widely used as a baseline algorithm to compare more cutting-edge algorithms against.

According to the authors,

“Some of Adam’s advantages are that the magnitudes of parameter updates are invariant to rescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameter, it does not require a stationary objective, it works with sparse gradients, and it naturally performs a form of step size annealing.”

Notation

  • f : \mathbb{R}^d \mapsto \mathbb{R} is a noise objective function that is differentiable w.r.t. to its parameter \theta. We want to minimize \mathbb{E}[f(\theta)].
  • f_1(\theta), \dots, f_T(\theta) are stochastic realizations of f at timesteps 1, \dots, T.
  • Let g_t = \nabla_\theta f_t (\theta) denote the gradient of f_t w.r.t. \theta.

Adam

At each timestep t, Adam keeps track of \hat{m}_t, an estimate of the first moment of the gradient g_t, and \hat{v}_t, an estimate of the second raw moment (uncentered variance) of the gradient g_t^2 = g_t \odot g_t (\odot represents element-wise multiplication). The Adam update is

\theta_t \leftarrow \theta_{t-1} - \alpha \cdot \hat{m}_t / (\sqrt{\hat{v}_t + \epsilon}),

where \alpha and \epsilon are hyperparameters. The full description of Adam is in the Algorithm below:

Adam algorithm (Algorithm 1 of Kingma & Ba).

Comparison with other methods

In comparison with Adam, the vanilla SGD update is

\theta_t \leftarrow \theta_{t-1} - \alpha \cdot g_t ,

and the classical momentum-based SGD update is

\theta_t \leftarrow \theta_{t-1} - \alpha \cdot \hat{m}_t.

The basic version of AdaGrad has update rule

\begin{aligned} \theta_t \leftarrow \theta_{t-1} - \alpha \cdot g_t / \sqrt{\sum_{i=1}^t g_t^2}. \end{aligned}

AdaGrad corresponds to Adam with \beta_1 = 0, infinitesimal (1-\beta_2), and replacing \alpha with \alpha_t = \alpha \cdot t^{-1/2} (see Section 5 of Reference 1 for a few more details).

AdamW

AdamW, proposed by Loshchilov & Hutter 2019 (Reference 4), is a simple modification of Adam which “improves regularization in Adam by decoupling the weight decay from the gradient-based update”.

The original weight decay update was of the form

\begin{aligned} \theta_{t+1} = (1-\lambda) \theta_t - \alpha g_t, \end{aligned}

where \lambda defines the rate of weight decay per step. Loshchilov & Hutter (2019) note that for standard SGD, this weight decay is equivalent to L2 regularization (Proposition 1 of Reference 4), but the two are not equivalent for the Adam update. They present Adam with L2 regularization and Adam with decoupled weight decay (AdamW) together in Algorithm 2 of the paper:

Overall, the authors found that AdamW performed better than Adam with L2 regularization.

PyTorch’s implementation of AdamW can be found here.

References:

  1. Kingma, D. P., and Ba, J. (2014). Adam: A Method for Stochastic Optimization.
  2. Duchi, J., et al. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.
  3. Tieleman, T., and Hinton, G. (2012). Lecture 6.5 – RMSProp, Coursera: Neural Networks for Machine Learning. (If someone has a link/PDF for the slides, I’d be grateful!)
  4. Loshchilov, I., and Hutter, F. (2019). Decoupled Weight Decay Normalization.

What is elastic weight consolidation?

Elastic weight consolidation (EWC) is a technique proposed by Kirkpatrick et al. (2016) (Reference 1) as a way for avoiding catastrophic forgetting in neural networks.

Set-up and notation

Imagine that we want to train a neural network to be able to perform well on several different tasks. In real-world settings, it’s not always possible to have all the data from all tasks available at the beginning of model training, so we want our model to be able to learn continually: “that is, [have] the ability to learn consecutive tasks without forgetting how to perform previously trained tasks.”

This turns out to be difficult for neural networks because of a phenomenon known as catastrophic forgetting. Assume that I have already trained a model to be good at one task, task A. Now, when I further train this model to be good at a second task, task B, performance on task A tends to degrade, with the performance drop often happening abruptly (hence “catastrophic”). This happens because “the weights in the network that are important for task A are changed to meet the objectives of task B”.

Let’s rewrite the previous paragraph with some mathematical notation. Assume that we have training data \mathcal{D}_A and \mathcal{D}_B for tasks A and B respectively. Let \mathcal{D} = \mathcal{D}_A \cup \mathcal{D}_B. Our neural network is parameterized by \theta: model training corresponds to finding an “optimal” value of \theta.

Let’s say that training on task A gives us the network parameters \theta_A^\star. Catastrophic forgetting says that when further training on task B (to reach new optimal parameters \theta_B^\star), \theta_B^\star is too far away from \theta_A^\star and hence has poor performance on task A.

Elastic weight consolidation (EWC)

The intuition behind elastic weight consolidation (EWC) is the following: since large neural networks tend to be over-parameterized, it is likely that for any one task, there are many values of \theta that will give similar performance. Hence, when further training on task B, we should constrain the model parameters to be “close” to \theta_A^\star in some sense. The figure below (Figure 1 from Reference 1) illustrates this intuition:How do we make this intuition concrete? Assume that we are in a Bayesian setting. What we would like to do is maximize the (log) posterior probability of \theta given all the data \mathcal{D}:

\begin{aligned} \log p(\theta | \mathcal{D}) = \log p(\mathcal{D} | \theta) + \log p(\theta) - \log p(\mathcal{D}). \end{aligned}

Note that the equation above applies if we replace all instances of \mathcal{D} with \mathcal{D}_A or \mathcal{D}_B. Assuming the data for tasks A and B are independent, let’s do some rewriting of the equation above:

\begin{aligned} \log p(\theta | \mathcal{D}) &= \log p(\mathcal{D}_A | \theta) + \log p(\mathcal{D}_B | \theta) + \log p(\theta) - \log p(\mathcal{D}_A) - \log p(\mathcal{D}_B) \\  &= \log p(\mathcal{D}_B | \theta) + \log p(\theta) - \log p(\mathcal{D}_A) - \log p(\mathcal{D}_B) \\  &\qquad + [\log p(\theta | \mathcal{D}_A) - \log p(\theta) + \log p (\mathcal{D}_A) ] \\  &= \log p(\mathcal{D}_B | \theta) + \log p(\theta | \mathcal{D}_A) - \log p(\mathcal{D}_B). \end{aligned}

Thus, maximizing the log posterior probability of \theta is equivalent to maximizing

\log p(\mathcal{D}_B | \theta) + \log p(\theta | \mathcal{D}_A). \qquad -(1)

Assume that we have already trained our model in some way to get to \theta_A^\star which does well on task A. We recognize the first term above as the log-likelihood function for task B’s training data: maximizing just the first term alone corresponds to the usual maximum likelihood estimation (MLE). For the second term, we can approximate the posterior distribution of \theta based on task A’s data as a Gaussian distribution centered around \theta_A^\star with some covariance. (This is known as the Laplace approximation). Simplifying even further: instead of the distribution having precision matrix equal to the full Fisher information matrix F, we assume that the precision matrix is diagonal having the same values as the diagonal of F.

Putting it altogether, maximizing (1) is equivalent to minimizing

\mathcal{L}(\theta) = \mathcal{L}_B (\theta) + \sum_i \frac{\lambda}{2} F_i (\theta_i - \theta_{A,i}^\star)^2,

where \mathcal{L}_B(\theta) is the loss for task B only, i indexes the weights and \lambda \geq 0 is an overall hyperparameter that balances the importance of the two tasks. EWC minimizes \mathcal{L}(\theta).

Why the name “elastic weight consolidation”? From the authors:

“This constraint is implemented as a quadratic penalty, and can therefore be imagined as a spring anchoring the parameters to the previous solution, hence the name elastic. Importantly, the stiffness of this spring should not be the same for all parameters; rather, it should be greater for those parameters that matter most to the performance during task A.”

References:

  1. Kirkpartick, J., et al. (2016). Overcoming catastrophic forgetting in neural networks.

A simple probabilistic algorithm for estimating the number of distinct elements in a data stream

I just came across a really interesting and simple algorithm for estimating the number of distinct elements in a stream of data. The paper (Chakraborty et al. 2023) is available on arXiv; see this Quanta article (Reference 2) for a layman’s explanation.

Problem statement

Let’s state the problem formally. Let’s say we are given a stream \mathcal{A} = \langle a_1, \dots, a_m  \rangle where each a_i \in [n], and let \mathsf{F}_0(\mathcal{A}) denote the number of distinct elements of \mathcal{A}. For any input parameters (\epsilon, \delta), we want an (\epsilon, \delta)-approximation of \mathsf{F}_0(\mathcal{A}), i.e. a number c such that

\begin{aligned} \mathbb{P} \left\{ (1-\epsilon) \cdot \mathsf{F}_0(\mathcal{A}) \leq c \leq (1+\epsilon) \cdot \mathsf{F}_0(\mathcal{A}) \right\} \geq 1 - \delta. \end{aligned}

A naive solution would be to maintain a buffer \mathcal{X} which keeps track of all distinct elements seen so far. As we inspect each new item, we check if the item is in \mathcal{X}: if it’s there we discard the new item, and if it’s not we add it to \mathcal{X}. After going through the m items, the size of \mathcal{X} is exactly the number of distinct items. However, it has bad worst case memory and checking whether an element is in \mathcal{X} can be costly if \mathcal{X} is large (e.g. think of counting the number of distinct members visiting google.com today).

Algorithm

There is a long line of work seeking to do better than the naive solution (see Bibliographic Remarks in Reference 1); Chakraborty et al. claims to be the first state-of-the-art algorithm whose proof requires just basic probability theory. The algorithm is easy enough to be reproduced here; see Reference 2 for an excellent walkthrough of an example run of the algorithm.

Algorithm 1 from Chakraborty et al. (2023), annotations are mine.

At a high level, the algorithm maintains a buffer \mathcal{X} of seen items. The algorithm proceeds in rounds. In each round, we check whether each incoming item is in \mathcal{X} or not. If it is not in \mathcal{X}, add it with probability p. If it is in \mathcal{X}, keep it in with probability p. p is made smaller and smaller as we go through more rounds. Whenever the buffer reaches its threshold number of items (threshold fixed throughout the algorithm), each element is removed from it with probability 0.5. When we run out of elements, our estimate of the number of distinct items is |\mathcal{X}| / p.

Here is the theorem for this algorithm:

Theorem. For any data stream \mathcal{A} and any 0, \epsilon, \delta < 1, the algorithm above outputs an (\epsilon, \delta)-approximation of \mathsf{F}_0(\mathcal{A}). It uses O(\frac{1}{\epsilon^2} \cdot \log n \cdot (\log m + \log \frac{1}{\delta})) space in the worst case.

The proof is not too long (about 3 pages) and while requiring some restatement of the problem via alternate algorithms, the most advanced tool it uses is Chernoff’s bound.

To obtain an (\epsilon, \delta)-approximation for m items, the algorithm requires a buffer of size \left\lceil \frac{12}{\epsilon}^2 \log \left( \frac{8m}{\delta}\right) \right\rceil. This grows logarithmically with m. If we want our approximation to be twice as accurate, we need the buffer to be 4 times as large (\epsilon). In practice the buffer can probably be much smaller: the authors are only interested in big-O type dependencies and so made the constants bigger wherever it made the proof simpler. (It’s worth noting that the algorithm gives the exact number of distinct items if the buffer happens to be larger than the true number of distinct items.)

Simulations

This algorithm is really easy to implement. Here is a quick implementation in R (definitely can be optimized further, just coded it up quickly for prototyping purposes):

# threshold: number of elements buffer can hold
# trajectory: if TRUE, return the estimate after every item in stream (default FALSE)
estimate_distinct_elements <- function(data_stream, threshold, trajectory = FALSE) {
  p <- 1
  m <- length(data_stream)
  buffer <- c()
  hit_fail_event <- FALSE
  estimates <- rep(NA, m)
  for (i in 1:m) {
    buffer <- setdiff(buffer, data_stream[i])
    if (runif(1) < p) buffer <- c(buffer, data_stream[i])
    if (length(buffer) == threshold) {
      buffer <- buffer[runif(threshold) < 0.5]
      p <- p / 2
      if (length(buffer) == threshold) {  # fail event
        hit_fail_event <- TRUE
      }
    }
    estimates[i] <- ifelse(hit_fail_event, -1, length(buffer) / p)
  }
  
  if (trajectory) {
    return(estimates)
  } else {
    return(estimates[m])
  }
}

Let’s take this for a run with some simulations. All simulation code can be found here. For a start, let’s generate a stream of length 100, where each item is an integer from 1 to 100 (inclusive). For the particular data stream we generated, there were 60 distinct elements. This means that for the naive solution, we would have buffer of length 60.

First, we run the algorithm 1000 times with buffer length 30, half of what the naive solution needed. The algorithm never hit the “fail” event, and here are some summary statistics of the distinct count estimate:

#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  34.00   52.00   56.00   60.02   68.00  104.00 

The mean and median are pretty close to the true value! Here is a histogram of the estimates:

It’s worth noting that while the estimate is generally pretty good (50% chance of being within 8 of the true number), it can be wildly off as well!

Let’s rerun the simulation for the same data stream but with buffer length 15 (a quarter of the true number of distinct items). Again the algorithm never hit the “fail” event. The summary statistics and histogram are below and as one might expect, while the estimates are still centered at the right place, there is more variance in the estimate. (The histograms might look similar due to different scales on the x-axis.)

#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   8.00   48.00   56.00   59.81   72.00  144.00 

Since this is a streaming algorithm, it provides an estimate of the number of distinct items after inspecting each item. Let’s look at how good the estimate is over time. Here are the plots for 10 estimate trajectories (true count in blue) for buffer length = 30 and buffer length = 15. Notice that the algorithm gives an exact estimate until the buffer is first filled. As expected, larger buffer length tends to give better estimates.

Let’s do one final simulation: Let’s look at a stream of length 10,000 with each element being a random integer from 1 to 4,000 (inclusive). For our data, the true number of distinct elements is 3,686.

Here is the plot of 10 trajectory estimates with buffer length 500 (about 14% of the true number of uniques) as well as the relative error of the estimate:

That’s not too bad! The estimates are pretty much within 10% of the true values. How about a much smaller buffer length of 50 (1.4% of the true number of uniques):

There’s much wider variation in the estimates now, probably too much to be useful for any real-world application.

References:

  1. Chakraborty, S., et al. (2023). Distinct Elements in Streams: An Algorithm for the (Text) Book.
  2. Nadis, S. (2024). Computer Scientists Invent an Efficient New Way to Count.

Vectorization and Kronecker products

If \mathbf{A} \in \mathbb{R}^{m \times n} and \mathbf{B} \in \mathbb{R}^{p \times q}, the Kronecker product \mathbf{A} \otimes \mathbf{B} is the pm \times qn block matrix defined by

\begin{aligned} \mathbf{A} \otimes \mathbf{B} = \begin{pmatrix} a_{11}\mathbf{B} & \dots & a_{1n}\mathbf{B} \\ \vdots & \ddots & \vdots \\ a_{m1}\mathbf{B} & \dots & a_{mn}\mathbf{B} \end{pmatrix}. \end{aligned}

For a matrix \mathbf{A} \in \mathbb{R}^{m \times n}, the vectorization of \mathbf{A} is the mn \times 1 column vector obtained by stacking \mathbf{A}‘s columns on top of each other:

\begin{aligned} \text{vec}(\mathbf{A}) = \begin{pmatrix} a_{11} & \dots & a_{m1} & \dots & a_{1n} & \dots & a_{mn} \end{pmatrix}^\top. \end{aligned}

It turns out that vectorization and Kronecker products play nicely together as in the following proposition:

Proposition. If \mathbf{A} \in \mathbb{R}^{p \times q}, \mathbf{X} \in \mathbb{R}^{q \times m} and \mathbf{B} \in \mathbb{R}^{m \times n}, then

\begin{aligned} \text{vec}(\mathbf{AXB}) = (\mathbf{B}^\top \otimes \mathbf{A}) \text{vec}(\mathbf{X}). \end{aligned}

Proof: This proof is taken from Reference 1. Let \boldsymbol{b}_k and \boldsymbol{x}_k denote the kth row of \mathbf{B} and \mathbf{X} respectively. Let’s compute the kth column of \mathbf{AXB}:

\begin{aligned} (\mathbf{AXB})_k &= \mathbf{A}(\mathbf{X}\boldsymbol{b}_k ) = \mathbf{A} \sum_{i=1}^m b_{ik} \boldsymbol{x}_i \\  &= \begin{pmatrix} b_{1k}\mathbf{A} & b_{2k}\mathbf{A} & \dots & b_{mk}\mathbf{A} \end{pmatrix} \begin{pmatrix} \boldsymbol{x}_1 \\ \boldsymbol{x}_2 \\ \vdots \\ \boldsymbol{x}_m \end{pmatrix} \\  &= (\boldsymbol{b}_k^\top \otimes \mathbf{A}) \text{vec}(\mathbf{X}). \end{aligned}

Stacking the columns of \mathbf{AXB} on top of each other:

\begin{aligned} \text{vec}(\mathbf{AXB}) = \begin{pmatrix} \boldsymbol{b}_1^\top \otimes \mathbf{A} \\ \vdots \\ \boldsymbol{b}_n^\top \otimes \mathbf{A} \end{pmatrix} \text{vec}(\mathbf{X}) = (\mathbf{B}^\top \otimes \mathbf{A}) \text{vec}(\mathbf{X}),  \end{aligned}

which concludes the proof.

Here is one nice application of the proposition above. Consider the matrix equation \mathbf{AXB} = \mathbf{C}. Vectorizing both sides (also known as the “vec trick”), we get

\begin{aligned} \text{vec}(\mathbf{C}) =\text{vec}(\mathbf{AXB}) = (\mathbf{B}^\top \otimes \mathbf{A}) \text{vec}(\mathbf{X}). \end{aligned}

It can be shown that the Kronecker product \mathbf{B}^\top \otimes \mathbf{A} is invertible if and only if \mathbf{A} and \mathbf{B}^\top (or equivalently \mathbf{B}) are invertible. Thus, it follows that \mathbf{AXB} = \mathbf{C} has a unique solution if and only if \mathbf{A} and \mathbf{B} are invertible.

Here is a less trivial example (from commenter Jarle Tufto): consider the matrix equation \mathbf{X} = \mathbf{AXB} + \mathbf{C}. By applying the vec trick,

\begin{aligned} \text{vec}(\mathbf{X}) &= (\mathbf{B}^\top \otimes \mathbf{A}) \text{vec}(\mathbf{X}) + \text{vec}(\mathbf{C}), \\  (\mathbf{I} - \mathbf{B}^\top \otimes \mathbf{A}) \text{vec}(\mathbf{X}) &= \text{vec}(\mathbf{C}), \\  \text{vec}(\mathbf{X}) &= (\mathbf{I} - \mathbf{B}^\top \otimes \mathbf{A})^{-1}\text{vec}(\mathbf{C}), \end{aligned}

assuming \mathbf{I} - \mathbf{B}^\top \otimes \mathbf{A} is invertible. There are a few more examples of applying this trick to matrix equations in Reference 3.

References:

  1. Kronecker Product and the vec Operator.
  2. Wikipedia. Kronecker product.
  3. Zhang, T. MATH 470 Independent Study in Matrix Theory: The Kronecker Product.

The mixed-product property of the Kronecker product

If \mathbf{A} \in \mathbb{R}^{m \times n} and \mathbf{B} \in \mathbb{R}^{p \times q}, the Kronecker product \mathbf{A} \otimes \mathbf{B} is the pm \times qn block matrix defined by

\begin{aligned} \mathbf{A} \otimes \mathbf{B} = \begin{pmatrix} a_{11}\mathbf{B} & \dots & a_{1n}\mathbf{B} \\ \vdots & \ddots & \vdots \\ a_{m1}\mathbf{B} & \dots & a_{mn}\mathbf{B} \end{pmatrix}. \end{aligned}

See the Wikipedia article (Reference 1) for a full list of properties of the Kronecker product. This post focuses on the mixed-product property of the Kronecker product. It’s a nice property that is often used to prove other properties of the Kronecker product. Here it is:

Proposition. For matrices \mathbf{A} \in \mathbb{R}^{m \times p}, \mathbf{B} \in \mathbb{R}^{m \times q}, \mathbf{C} \in \mathbb{R}^{p \times n}, \mathbf{D} \in \mathbb{R}^{q \times n}, we have

\begin{aligned} (\mathbf{A} \otimes \mathbf{B})(\mathbf{C} \otimes \mathbf{D}) = (\mathbf{A}\mathbf{C}) \otimes (\mathbf{B}\mathbf{D}). \end{aligned}

This is easy to prove if we use matrix multiplication for block matrices. We can write the RHS as

\begin{aligned} (\mathbf{A}\mathbf{C}) \otimes (\mathbf{B}\mathbf{D}) &= \begin{pmatrix} (\mathbf{A}\mathbf{C})_{11}(\mathbf{B}\mathbf{D}) & \dots & (\mathbf{A}\mathbf{C})_{1n}(\mathbf{B}\mathbf{D}) \\ \vdots & \ddots & \vdots \\ (\mathbf{A}\mathbf{C})_{m1}(\mathbf{B}\mathbf{D}) & \dots & (\mathbf{A}\mathbf{C})_{mn}(\mathbf{B}\mathbf{D}) \end{pmatrix}. \end{aligned}

The LHS can be expanded as

\begin{aligned} (\mathbf{A} \otimes \mathbf{B})(\mathbf{C} \otimes \mathbf{D}) &= \begin{pmatrix} a_{11}\mathbf{B} & \dots & a_{1p}\mathbf{B} \\ \vdots & \ddots & \vdots \\ a_{m1}\mathbf{B} & \dots & a_{mp}\mathbf{B} \end{pmatrix} \begin{pmatrix} c_{11}\mathbf{D} & \dots & c_{1n}\mathbf{D} \\ \vdots & \ddots & \vdots \\ c_{p1}\mathbf{D} & \dots & c_{pn}\mathbf{D} \end{pmatrix}. \end{aligned}

Using block matrix multiplication, the ijth block of the LHS is

\begin{aligned} (\text{LHS})_{ij} &= \sum_{k=1}^p (a_{ik}\mathbf{B}) (c_{kj}\mathbf{D} ) \\  &= \sum_{k=1}^p (a_{ik}c_{kj}) (\mathbf{BD}) \\  &= (\mathbf{AC})_{ij} (\mathbf{BD}) \\  &= (\text{RHS})_{ij}. \end{aligned}

The mixed-product property can be used to prove many other properties. Here’s one corollary:

Corollary. For square matrices \mathbf{A} \in \mathbb{R}^{m \times m} and \mathbf{B} \in \mathbb{R}^{m \times m} and positive integer k, we have

\begin{aligned} (\mathbf{A} \otimes \mathbf{B})^k = \mathbf{A}^k \otimes \mathbf{B}^k. \end{aligned}

Proof: Proceed by induction on k. The case of k = 1 is trivial. Assume the statement is true for some k \geq 1. Replacing \mathbf{A}, \mathbf{B}, \mathbf{C} and \mathbf{D} in the mixed-product property with \mathbf{A}^k, \mathbf{B}^k, \mathbf{A} and \mathbf{B}, we get

\begin{aligned} (\mathbf{A}^k \otimes \mathbf{B}^k)(\mathbf{A} \otimes \mathbf{B}) &= (\mathbf{A}^k\mathbf{A}) \otimes (\mathbf{B}^k\mathbf{B}), \\  (\mathbf{A} \otimes \mathbf{B})^k(\mathbf{A} \otimes \mathbf{B}) &= \mathbf{A}^{k+1} \otimes \mathbf{B}^{k+1}, \end{aligned}

which proves the induction step.

References:

  1. Wikipedia. Kronecker product.

Rough approximations to move between logit and probability scales

The logit and expit functions are commonly used to move between the probability scale [0, 1] and the whole real number line, especially when working with supervised learning models with binary outcomes. These functions are very easy to define, here is an implementation in R:

logit <- function(p) {
  log(p / (1 - p))
}

expit <- function(x) {
  exp(x) / (1 + exp(x))
}

When working in this space, it’s helpful to have some approximations to move between the logit and probability scales. (As an analogy, it is helpful to know that for a normal distribution, the interval \pm 2 standard deviations around the mean covers about 95% of the distribution, while \pm 3 standard deviations covers about 99%.)

Here are some rough approximations to help you estimate probabilities from logit scores and vice versa:

Logit Probability
-7 0.001
-5 0.01
-3 0.05
-2 0.1
-1 0.25
0 0.5
1 0.75
2 0.9
3 0.95
5 0.99
7 0.999

What is the Plackett-Luce model?

The Plackett-Luce model, attributed to Robert L. Plackett (Reference 1) and R. Duncan Luce (Reference 2), is a model for rankings data. In this post, we first describe the simpler Bradley-Terry model, then introduce the Plackett-Luce model as a generalization of the Bradley-Terry model.

The Bradley-Terry model

The Bradley-Terry model is a model that allows us to make pairwise comparisons between items. Given two items i and j, let i \succ j denote “i is preferred to j”. The Bradley-Terry model assumes that pairwise preferences are generated according to the following probability model: there is some \alpha_i \in \mathbb{R} and \alpha_j \in \mathbb{R} associated with items i and j respectively, and

\begin{aligned} \mathbb{P} \{ i \succ j \} = \dfrac{\alpha_i}{\alpha_i + \alpha_j}. \end{aligned}

There are extensions to the Bradley-Terry model that allow for ties, home-field advantage and specific pairwise interaction terms; see for e.g. Davidson (1970) (Reference 3).

The Plackett-Luce model

Whereas the Bradley-Terry model works with pairwise preferences, the Plackett-Luce model works with listwise preferences. Let’s say we have some items i_1, \dots, i_J, and we are given the ranked preference list i_1 \succ i_2 \succ \dots \succ i_J. The Plackett-Luce model assumes that these preferences were generated according to a sequence of choices: the top-ranked item is chosen from the pool with some probability, then the second-ranked item is independently chosen from the remaining pool of items with some probability, and so on. Each item i is associated with a real number \alpha_i. At each step, if S denotes the remaining items, then the probability of choosing item i from this pool of items is assumed to be

\begin{aligned} \mathbb{P} \{ \text{choosing } i \text{ from } S \} = \dfrac{\alpha_i}{\sum_{j \in S} \alpha_j}. \end{aligned}

This is probably easier understood via an example. Assume that we have 3 items, and the preference we are given is i_3 \succ i_1 \succ i_2. The probability associated with this ranked list is

\begin{aligned} \mathbb{P} \{ i_3 \succ i_1 \succ i_2 \} &= \mathbb{P} \{ \text{choosing } i_3 \text{ from } \{i_1, i_2, i_3 \} \} \\  &\qquad\cdot \mathbb{P} \{ \text{choosing } i_1 \text{ from } \{i_1, i_2 \} \} \\  &\qquad \cdot \mathbb{P} \{ \text{choosing } i_2 \text{ from } \{i_2 \} \} \\  &= \dfrac{\alpha_{i_3}}{\alpha_{i_1} + \alpha_{i_2} + \alpha_{i_3}} \cdot \dfrac{\alpha_{i_1}}{\alpha_{i_1} + \alpha_{i_2}} \cdot \dfrac{\alpha_{i_2}}{\alpha_{i_2}}. \end{aligned}

If we only consider lists of length 2, the Plackett-Luce model becomes the Bradley-Terry model.

How to fit a Bradley-Terry model (and the Plackett-Luce model by extension)

For both the Bradley-Terry and Plackett-Luce models, the question is how to determine the \alpha_i‘s for the items. This boils down to modeling the \alpha_i as a function of data/features associated with item i. Assume that each item i has a feature vector x_i. We can write the model generically as

\begin{aligned} \alpha_i = f_\theta(x_i), \end{aligned}

where \theta \in \Theta for some parameter space \Theta. We then find the “optimal” \theta based on some framework, most commonly maximum likelihood estimation (MLE).

Let’s write this out for the Bradley-Terry model for a specific model family (details with code in this previous blog post). Assume that \log \alpha_i is a linear combination of the features x_i, and the only features we consider are the one-hot encoding features: “is this item j?” for all items j. In other words, we assume

\begin{aligned} \alpha_i = \exp \left( \sum_j \beta_j 1\{ i = j \} \right) = \exp (\beta_i), \end{aligned}

and our job is to estimate the \beta_i. If our dataset D consist of triplets of the form (i, j, i \succ j), then the likelihood function is

\begin{aligned} L(D; \beta) = \prod_{(i, j, i \succ j) \in D} \dfrac{\alpha_i}{\alpha_i + \alpha_j} = \prod_{(i, j, i \succ j) \in D} \dfrac{\exp (\beta_i)}{\exp(\beta_i) + \exp(\beta_j)} \end{aligned}

and the MLE is given by \hat\beta = \text{argmax}_\beta \; L(D; \beta).

Practically, the PlackettLuce R package fits the Plackett-Luce model, and is able to handle generalizations such as ties and partial rankings (see the vignette Reference 4 for details).

References:

  1. Plackett, R. L. (1975). “The Analysis of Permutations.”
  2. Luce, R. D. (1959). Individual choice behavior.
  3. Davidson, R. R. (1970). “On extending the bradley-terry model to accommodate ties in paired comparison experiments.”
  4. Turner, H., et al. (2023). Introduction to PlackettLuce.