# What is Andrew’s Sine?

Notational set-up

Let’s say we have data $x_1, \dots, x_n \stackrel{i.i.d.}{\sim} P_\theta$ for some probability distribution $P_\theta$, and we want to use the data to estimate the parameter $\theta$. If our estimate $\hat\theta$ is the solution to a minimization problem of the form

\begin{aligned} \text{minimize}_\theta \quad \sum_{i=1}^n \rho (x_i ; \theta) \end{aligned}

for some function $\rho$, then $\hat\theta$ is called an M-estimator. In maximum likelihood statistics, we choose $\rho (x ; \theta) = - \log f(x ; \theta)$, where $f (\cdot ; \theta)$ is the probability density associated with $P_\theta$.

Define

$\psi (x ; \theta) = \dfrac{\partial \rho(x ; \theta)}{\partial \theta}.$

Sometimes, we prefer to think of $\hat\theta$ as the solution to the implicit equation

\begin{aligned} \sum_{i=1}^n \psi(x ; \theta) = 0. \end{aligned}

In the field of robust statistics, we want to choose $\rho$ and/or $\psi$ such that the solution to the problem above has some robustness properties. These are the functions being referred to when you come across the terms “rho function” or “psi function” in this field.

Andrew’s Sine

In this blog we’ve already come across one rho/psi function used in robust statistics: the Tukey loss function. Andrew’s Sine is another psi function that appears in robust statistics. It is defined by

\begin{aligned} \psi(x) = \begin{cases} \sin (x / c) &\text{if } |x| \leq c\pi, \\ 0 &\text{otherwise}, \end{cases} \end{aligned}

where $c$ is a user-defined parameter. The rho function implied by this choice is

\begin{aligned} \rho(x) = \begin{cases} c[1 - \cos (z / c)] &\text{if } |x| \leq c\pi, \\ 2c &\text{otherwise}. \end{cases} \end{aligned}

Here are plots of both the rho and psi functions for a few choices of $c$:

Some history

Andrew’s Sine is named after D. F. Andrews. The first mention of it I could find was in Andrews (1974) (Reference 3), but it appears to have been proposed first by Andrews et al. 1972 (Reference 4), for which I can’t find a copy.

Choosing $c$

Reference 3 recommends using $c = 1.5$ or $c = 1.8$ without giving an explanation. Reference 5 suggests $c = 1.339$, noting that with this value of $c$, the corresponding M-estimator gives 95% efficiency at the normal distribution.

References:

1. Wolfram MathWorld. Andrew’s Sine.
2. Penn State Department of Statistics, STAT 501. 13.3 – Robust Regression Methods.
3. Andrews, D. F. (1974). A Robust Method for Multiple Linear Regression.
4. Andrews, D. F., et al. (1972). Robust Estimates of Location: Survey and Advances.
5. Young, D. S. (2017). Handbook of Regression Methods.

# What is the Tukey loss function?

The Tukey loss function

The Tukey loss function, also known as Tukey’s biweight loss function, is a loss function that is used in robust statistics. Tukey’s loss is similar to Huber loss in that it demonstrates quadratic behavior near the origin. However, it is even more insensitive to outliers because the loss incurred by large residuals is constant, rather than scaling linearly as it would for the Huber loss.

The loss function is defined by the formula

\begin{aligned} \ell (r) = \begin{cases} \frac{c^2}{6} \left(1 - \left[ 1 - \left( \frac{r}{c}\right)^2 \right]^3 \right) &\text{if } |r| \leq c, \\ \frac{c^2}{6} &\text{otherwise}. \end{cases} \end{aligned}

In the above, I use $r$ as the argument to the function to represent “residual”, while $c$ is a positive parameter that the user has to choose. A common choice of this parameter is $c = 4.685$: Reference 1 notes that this value results in approximately 95% asymptotic statistical efficiency as ordinary least squares when the true errors have the standard normal distribution.

You may be wondering why the loss function has a somewhat unusual constant of $c^2 / 6$ out front: it’s because this results in a nicer expression for the derivative of the loss function:

\begin{aligned} \ell' (r) = \begin{cases} r \left[ 1 - \left( \frac{r}{c}\right)^2 \right]^2 &\text{if } |r| \leq c, \\ 0 &\text{otherwise}. \end{cases} \end{aligned}

In the field of robust statistics, the derivative of the loss function is often of more interest than the loss function itself. In this field, it is common to denote the loss function and its derivative by the symbols $\rho$ and $\psi$ respectively.

(Note: The term “Tukey biweight function” is slightly ambiguous and as a commenter pointed out, it often refers to the derivative of the loss rather than the loss function itself. It’s better to be more specific, e.g. saying “Tukey biweight loss function” for the loss (as in this paper), and/or saying “Tukey biweight psi function” for the derivative of the loss (as in this implementation).

Plots of the Tukey loss function

Here is R code that computes the Tukey loss and its derivative:

tukey_loss <- function(r, c) {
ifelse(abs(r) <= c,
c^2 / 6 * (1 - (1 - (r / c)^2)^3),
c^2 / 6)
}

tukey_loss_derivative <- function(r, c) {
ifelse(abs(r) <= c,
r * (1 - (r / c)^2)^2,
0)
}


Here are plots of the loss function and its derivative for a few values of the $c$ parameter:

r <- seq(-6, 6, length.out = 301)
c <- 1:3

# plot of tukey loss
library(ggplot2)
theme_set(theme_bw())
loss_df <- data.frame(
r = rep(r, times = length(c)),
loss = unlist(lapply(c, function(x) tukey_loss(r, x))),
c = rep(c, each = length(r))
)

ggplot(loss_df, aes(x = r, y = loss, col = factor(c))) +
geom_line() +
labs(title = "Plot of Tukey loss", y = "Tukey loss",
col = "c") +
theme(legend.position = "bottom")


# plot of tukey loss derivative
loss_deriv_df <- data.frame(
r = rep(r, times = length(c)),
loss_deriv = unlist(lapply(c, function(x) tukey_loss_derivative(r, x))),
c = rep(c, each = length(r))
)

ggplot(loss_deriv_df, aes(x = r, y = loss_deriv, col = factor(c))) +
geom_line() +
labs(title = "Plot of derivative of Tukey loss", y = "Derivative of Tukey loss",
col = "c") +
theme(legend.position = "bottom")


Some history

According to what I could find, Tukey’s loss was proposed by Beaton & Tukey (1974) (Reference 2), but not in the form I presented above. Rather, they proposed weights to be used in an iterative reweighted least squares (IRLS) procedure.

References:

1. Belagiannis, V., et al. (2015). Robust Optimization for Deep Regression.
2. Beaton, A. E., and Tukey, J. W. (1974). The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data.

# Robit regression

Let’s say we $n$ observations, and for each observation we have values for $p$ features, $x_i \in \mathbb{R}^p$, and a binary response $y \in \{0, 1\}$. We want to build a predictive model for $y$ using the features we have. The most commonly used model for this is logistic regression. If we define $p_i = Pr \{ y_i = 1\}$, logistic regression models the relationship as

\begin{aligned} \text{logit}(p_i) = \log \left( \frac{p_i}{1-p_i} \right) = x_i^T \beta, \end{aligned}

where $\beta \in \mathbb{R}^p$ are the regression coefficients to be determined by the data.

Probit regression is another commonly used model for this setting. Instead of the link function being the logit, the link function is the inverse of the standard normal cumulative distribution function (CDF):

\begin{aligned} \Phi^{-1}(p_i) = x_i^T \beta, \end{aligned}

more commonly written as

\begin{aligned} p_i= \Phi(x_i^T \beta). \end{aligned}

Latent variable interpretation

Probit regression has a latent variable interpretation as well. The latent variable $z_i$ is modeled as

\begin{aligned} z_i = x_i^T \beta + \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1), \end{aligned}

and the response is simply the indicator variable $y_i = 1\{ z_i > 0 \}$. Logistic regression has a similar latent variable interpretation too: there, instead of the error being normally distributed, it is distributed according to the standard logistic distribution.

Robit regression

Robit regression is a robust version of logistic/probit regression. It posits the same latent variable model. However, the error is distributed as a scaled $t$-distribution:

\begin{aligned} z_i = x_i^T \beta + \epsilon, \quad \epsilon \sim t_\nu \left( 0, \frac{\nu - 2}{\nu}\right), \end{aligned}

where $\nu > 2$ is the degrees-of-freedom parameter (can be estimated from the data) and the $t$ distribution is scaled so that its standard deviation is 1. Because the $t$-distribution has fatter tails than the normal distribution, the model allows for occasional large errors, and as a result the estimate for $\beta$ is less affected by outliers.

Robit regression is equivalent to probit regression when $\nu = \infty$, and is close to logistic regression when $\nu = 7$.

References:

1. Gelman, A., Hill, J., and Vehtari, A. (2020). Regression and other stories (Chapter 15).