# What is the Tukey loss function?

The Tukey loss function

The Tukey loss function, also known as Tukey’s biweight loss function, is a loss function that is used in robust statistics. Tukey’s loss is similar to Huber loss in that it demonstrates quadratic behavior near the origin. However, it is even more insensitive to outliers because the loss incurred by large residuals is constant, rather than scaling linearly as it would for the Huber loss.

The loss function is defined by the formula

\begin{aligned} \ell (r) = \begin{cases} \frac{c^2}{6} \left(1 - \left[ 1 - \left( \frac{r}{c}\right)^2 \right]^3 \right) &\text{if } |r| \leq c, \\ \frac{c^2}{6} &\text{otherwise}. \end{cases} \end{aligned}

In the above, I use $r$ as the argument to the function to represent “residual”, while $c$ is a positive parameter that the user has to choose. A common choice of this parameter is $c = 4.685$: Reference 1 notes that this value results in approximately 95% asymptotic statistical efficiency as ordinary least squares when the true errors have the standard normal distribution.

You may be wondering why the loss function has a somewhat unusual constant of $c^2 / 6$ out front: it’s because this results in a nicer expression for the derivative of the loss function:

\begin{aligned} \ell' (r) = \begin{cases} r \left[ 1 - \left( \frac{r}{c}\right)^2 \right]^2 &\text{if } |r| \leq c, \\ 0 &\text{otherwise}. \end{cases} \end{aligned}

In the field of robust statistics, the derivative of the loss function is often of more interest than the loss function itself. In this field, it is common to denote the loss function and its derivative by the symbols $\rho$ and $\psi$ respectively.

(Note: The term “Tukey biweight function” is slightly ambiguous and as a commenter pointed out, it often refers to the derivative of the loss rather than the loss function itself. It’s better to be more specific, e.g. saying “Tukey biweight loss function” for the loss (as in this paper), and/or saying “Tukey biweight psi function” for the derivative of the loss (as in this implementation).

Plots of the Tukey loss function

Here is R code that computes the Tukey loss and its derivative:

tukey_loss <- function(r, c) {
ifelse(abs(r) <= c,
c^2 / 6 * (1 - (1 - (r / c)^2)^3),
c^2 / 6)
}

tukey_loss_derivative <- function(r, c) {
ifelse(abs(r) <= c,
r * (1 - (r / c)^2)^2,
0)
}


Here are plots of the loss function and its derivative for a few values of the $c$ parameter:

r <- seq(-6, 6, length.out = 301)
c <- 1:3

# plot of tukey loss
library(ggplot2)
theme_set(theme_bw())
loss_df <- data.frame(
r = rep(r, times = length(c)),
loss = unlist(lapply(c, function(x) tukey_loss(r, x))),
c = rep(c, each = length(r))
)

ggplot(loss_df, aes(x = r, y = loss, col = factor(c))) +
geom_line() +
labs(title = "Plot of Tukey loss", y = "Tukey loss",
col = "c") +
theme(legend.position = "bottom")


# plot of tukey loss derivative
loss_deriv_df <- data.frame(
r = rep(r, times = length(c)),
loss_deriv = unlist(lapply(c, function(x) tukey_loss_derivative(r, x))),
c = rep(c, each = length(r))
)

ggplot(loss_deriv_df, aes(x = r, y = loss_deriv, col = factor(c))) +
geom_line() +
labs(title = "Plot of derivative of Tukey loss", y = "Derivative of Tukey loss",
col = "c") +
theme(legend.position = "bottom")


Some history

According to what I could find, Tukey’s loss was proposed by Beaton & Tukey (1974) (Reference 2), but not in the form I presented above. Rather, they proposed weights to be used in an iterative reweighted least squares (IRLS) procedure.

References:

1. Belagiannis, V., et al. (2015). Robust Optimization for Deep Regression.
2. Beaton, A. E., and Tukey, J. W. (1974). The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data.

## 3 thoughts on “What is the Tukey loss function?”

1. Reblogged this on 667 per centimeter : climate science, quantitative biology, statistics, and energy policy and commented:

I’d love to see what this does in various kinds of regression. It may be possible to set up some kind of iterative regression scheme, where a normal regression with uniform weights is first done, and then the residuals are used to define a set of alternative weights via the Tukey Loss Function. Then the weighted regression is done, producing another set of residuals, and a new set of weights is defined. This should (eventually) settle down.

Liked by 1 person

2. Unacquainted readers may not only be interested in the appearance of c^2 / 6 in the formula, but perhaps more so why it is common practice that one sets c = 4.685? The literature suggests that the derivative is often what is called Tukey’s Biweight Function. From that perspective continuity conditions lead to the value of the function for |r| > c as being c^2/6.

Like

• Thanks for this note. I’ve added an short note on whether the biweight function refers to the loss or the derivative.

As for setting c = 4.685, I had a short sentence on it: for that value, we can compare the robust regression method to standard OLS in OLS theory and give a statement on its relative (statistical) efficiency. But to be honest I’m not sure how important this result really is today. It’s probably better to choose c based on the problem itself (e.g. at what point should we consider the size of the error immaterial?).

Like