What is Andrew’s Sine?

Notational set-up

Let’s say we have data x_1, \dots, x_n \stackrel{i.i.d.}{\sim} P_\theta for some probability distribution P_\theta, and we want to use the data to estimate the parameter \theta. If our estimate \hat\theta is the solution to a minimization problem of the form

\begin{aligned} \text{minimize}_\theta \quad \sum_{i=1}^n \rho (x_i ; \theta) \end{aligned}

for some function \rho, then \hat\theta is called an M-estimator. In maximum likelihood statistics, we choose \rho (x ; \theta) = - \log f(x ; \theta), where f (\cdot ; \theta) is the probability density associated with P_\theta.

Define

\psi (x ; \theta) = \dfrac{\partial \rho(x ; \theta)}{\partial \theta}.

Sometimes, we prefer to think of \hat\theta as the solution to the implicit equation

\begin{aligned} \sum_{i=1}^n \psi(x ; \theta) = 0.  \end{aligned}

In the field of robust statistics, we want to choose \rho and/or \psi such that the solution to the problem above has some robustness properties. These are the functions being referred to when you come across the terms “rho function” or “psi function” in this field.

Andrew’s Sine

In this blog we’ve already come across one rho/psi function used in robust statistics: the Tukey loss function. Andrew’s Sine is another psi function that appears in robust statistics. It is defined by

\begin{aligned} \psi(x) = \begin{cases} \sin (x / c) &\text{if } |x| \leq c\pi, \\ 0 &\text{otherwise}, \end{cases} \end{aligned}

where c is a user-defined parameter. The rho function implied by this choice is

\begin{aligned} \rho(x) = \begin{cases} c[1 - \cos (z / c)] &\text{if } |x| \leq c\pi, \\ 2c &\text{otherwise}. \end{cases} \end{aligned}

Here are plots of both the rho and psi functions for a few choices of c:

Some history

Andrew’s Sine is named after D. F. Andrews. The first mention of it I could find was in Andrews (1974) (Reference 3), but it appears to have been proposed first by Andrews et al. 1972 (Reference 4), for which I can’t find a copy.

Choosing c

Reference 3 recommends using c = 1.5 or c = 1.8 without giving an explanation. Reference 5 suggests c = 1.339, noting that with this value of c, the corresponding M-estimator gives 95% efficiency at the normal distribution.

References:

  1. Wolfram MathWorld. Andrew’s Sine.
  2. Penn State Department of Statistics, STAT 501. 13.3 – Robust Regression Methods.
  3. Andrews, D. F. (1974). A Robust Method for Multiple Linear Regression.
  4. Andrews, D. F., et al. (1972). Robust Estimates of Location: Survey and Advances.
  5. Young, D. S. (2017). Handbook of Regression Methods.

What is the Tukey loss function?

The Tukey loss function

The Tukey loss function, also known as Tukey’s biweight loss function, is a loss function that is used in robust statistics. Tukey’s loss is similar to Huber loss in that it demonstrates quadratic behavior near the origin. However, it is even more insensitive to outliers because the loss incurred by large residuals is constant, rather than scaling linearly as it would for the Huber loss.

The loss function is defined by the formula

\begin{aligned} \ell (r) = \begin{cases} \frac{c^2}{6} \left(1 - \left[ 1 - \left( \frac{r}{c}\right)^2 \right]^3 \right) &\text{if } |r| \leq c, \\ \frac{c^2}{6} &\text{otherwise}. \end{cases} \end{aligned}

In the above, I use r as the argument to the function to represent “residual”, while c is a positive parameter that the user has to choose. A common choice of this parameter is c = 4.685: Reference 1 notes that this value results in¬†approximately 95% asymptotic statistical efficiency as ordinary least squares when the true errors have the standard normal distribution.

You may be wondering why the loss function has a somewhat unusual constant of c^2 / 6 out front: it’s because this results in a nicer expression for the derivative of the loss function:

\begin{aligned} \ell' (r) = \begin{cases} r \left[ 1 - \left( \frac{r}{c}\right)^2 \right]^2 &\text{if } |r| \leq c, \\ 0 &\text{otherwise}. \end{cases} \end{aligned}

In the field of robust statistics, the derivative of the loss function is often of more interest than the loss function itself. In this field, it is common to denote the loss function and its derivative by the symbols \rho and \psi respectively.

(Note: The term “Tukey biweight function” is slightly ambiguous and as a commenter pointed out, it often refers to the derivative of the loss rather than the loss function itself. It’s better to be more specific, e.g. saying “Tukey biweight loss function” for the loss (as in this paper), and/or saying “Tukey biweight psi function” for the derivative of the loss (as in this implementation).

Plots of the Tukey loss function

Here is R code that computes the Tukey loss and its derivative:

tukey_loss <- function(r, c) {
  ifelse(abs(r) <= c,
         c^2 / 6 * (1 - (1 - (r / c)^2)^3),
         c^2 / 6)
}

tukey_loss_derivative <- function(r, c) {
  ifelse(abs(r) <= c,
         r * (1 - (r / c)^2)^2,
         0)
}

Here are plots of the loss function and its derivative for a few values of the c parameter:

r <- seq(-6, 6, length.out = 301)
c <- 1:3

# plot of tukey loss
library(ggplot2)
theme_set(theme_bw())
loss_df <- data.frame(
  r = rep(r, times = length(c)),
  loss = unlist(lapply(c, function(x) tukey_loss(r, x))),
  c = rep(c, each = length(r))
)

ggplot(loss_df, aes(x = r, y = loss, col = factor(c))) +
  geom_line() +
  labs(title = "Plot of Tukey loss", y = "Tukey loss",
       col = "c") +
  theme(legend.position = "bottom")

# plot of tukey loss derivative
loss_deriv_df <- data.frame(
  r = rep(r, times = length(c)),
  loss_deriv = unlist(lapply(c, function(x) tukey_loss_derivative(r, x))),
  c = rep(c, each = length(r))
)

ggplot(loss_deriv_df, aes(x = r, y = loss_deriv, col = factor(c))) +
  geom_line() +
  labs(title = "Plot of derivative of Tukey loss", y = "Derivative of Tukey loss",
       col = "c") +
  theme(legend.position = "bottom")

Some history

According to what I could find, Tukey’s loss was proposed by Beaton & Tukey (1974) (Reference 2), but not in the form I presented above. Rather, they proposed weights to be used in an iterative reweighted least squares (IRLS) procedure.

References:

  1. Belagiannis, V., et al. (2015). Robust Optimization for Deep Regression.
  2. Beaton, A. E., and Tukey, J. W. (1974). The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data.

What is a Brier score?

The Brier score is a cost function (or loss function) that measures the accuracy of probabilistic predictions. Because it is a cost function, a lower Brier score indicates more accurate predictions while a higher Brier score indicates less accurate predictions. In its most common formulation, the best and worst possible Brier scores are 0 and 1 respectively.

Brier score for binary predictions

Assume that we have n binary outcomes o_1, \dots, o_n \in \{ 0, 1\} that we want to predict. Let our predictions be denoted by p_i = \text{Pr}(i\text{th outcome is } 1) for i = 1, \dots, n. The Brier score for these predictions is given by the formula

\begin{aligned} BS = \dfrac{1}{n}\sum_{i=1}^n (p_i - o_i)^2. \end{aligned}

The Brier score is only one of several loss functions one could use for binary classification. See this Wikipedia article for a longer list, and see this reference for a comparison between the Brier score and logistic loss.

Brier score for multi-category predictions

The Brier score can be generalized for multi-class predictions. Let K be the number of possible outcomes (e.g. if the possible outcomes are A, B and C, then K = 3). Without loss of generality, let the outcome classes be denoted by 1, 2, \dots, K. For case i, if the true outcome is class k, define

\begin{aligned} o_{ij} = \begin{cases} 1 &\text{if } j = k, \\ 0 &\text{otherwise.} \end{cases} \end{aligned}

Let p_{ij} = \text{Pr}(i\text{th outcome is class } j) denote the predicted probability of the ith outcome being in class j. Then the (multi-category) Brier score is given by

\begin{aligned} BS = \dfrac{1}{n}\sum_{i=1}^n \sum_{j = 1}^K (p_{ij} - o_{ij})^2. \end{aligned}

In this setting, the best possible Brier score is 0 (probability 1 for the correct class, probability 0 for everything else) while the worst possible Brier score is 2 (probability 1 for one of the wrong classes, probability 0 for everything else). Note that in the special case of K = 2 (the binary setting), the Brier score defined in this way is exactly twice that of the Brier score as defined in the previous section.

Don’t use Brier scores for ordinal predictions

The Wikipedia article for Brier score notes that it is “inappropriate for ordinal variables which can take on three or more values.” This is because any misclassification is treated in the same way. For example, let’s say we are trying to predict whether a soccer team is going to score (i) 0-1 goals (“class 1”), (ii) 2-3 goals (“class 2”) or (iii) more than 3 goals (“class 3”). Let’s say we only want to predict one outcome which turned out to be class 1. Consider the following two predictions:

A. Class 2 with 100% probability.
B. Class 3 with 100% probability.

Because of the ordinal scale, we would prefer prediction A over prediction B. However, the Brier score for each of these predictions would be the same: 2.

Brier skill score

A noted shortcoming of the Brier score is that it doesn’t fare well when the classes are imbalanced. For example, suppose that we are trying to predict 10 outcomes, 9 of which are 1 (the remaining case being 0). Suppose that our model has a Brier score of 0.15. Is it a good model?

Even though the score sounds low, the model isn’t good at all. Consider the naive prediction of “always predict class 1 with 100% probability”: this has a Brier score of

\begin{aligned} \dfrac{1}{10}[9 \cdot (1-1)^2 + 1 \cdot (1-0)^2] = 0.1, \end{aligned}

which is better than our model!

For such settings, Brier skill scores are more meaningful in that they compare our model’s Brier score with that of some “reference forecast”. The formula for a model’s Brier skill score is

\begin{aligned} BSS = 1 - \dfrac{BS}{BS_{ref}}, \end{aligned}

where BS and BS_{ref} are the Brier scores of the model and the reference forecast respectively. The Brier skill score takes on values from -\infty to 1, with higher scores meaning more accurate predictions (compared with the reference). A BSS of 0 indicates the same accuracy as that of the reference.

The reference is often some naive prediction one can make, e.g. always predict the dominant class with 100% probability, or predict each class with its long-term proportion.

References:

  1. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability.
  2. Wikipedia. Brier score.
  3. Statistics How To. Brier Score: Definition, Examples.

Loss functions for different types of data

Today I attended a very interesting talk by Madeleine Udell titled “Big data is low rank”. The slides for the talk are available here, and the paper which the slides are based on is available here.

This post, however, is not about the main topic of the talk but one of the slides from it. Madeleine had a very nice slide listing some of the common loss functions one might use for different types of data. I was familiar with the loss functions for real and boolean data, but not with those for other types of data. Here is the list for (my future) easy reference (it’s slide 13 of the slides linked above):