# What is a proper scoring rule?

What is a proper scoring rule?

In the realm of forecasting, a scoring rule is a way to measure how good a probabilistic forecast is.

Mathematically, if $\Omega$ is the set of possible outcomes, then a probabilistic forecast is a probability distribution on $\Omega$ (i.e. how likely each possible outcome is). A scoring rule $S$ (or a score) is a function that takes a probability distribution on $\Omega$ (denoted by $P$) and an outcome $\omega \in \Omega$ as input and returns a real-valued number as output. If $P$ is the probabilistic forecast and $\omega$ is the actual outcome, then $S(P, \omega)$ is interpreted as the reward (loss resp.) to the forecaster if the score is positively-oriented (negatively-oriented resp.).

Assume that the true probability distribution is denoted by $Q$. For each probabilistic forecast $P$, we can define the expected score as $S(P, Q) = \mathbb{E}_Q [S(P, \omega)] = \int S(P, \omega) dQ(\omega).$

A positively oriented scoring rule is said to be proper if for all probability distributions $P$ and $Q$, we have $S(Q, Q) \geq S(P, Q).$

In other words, for a proper score, the forecaster maximizes the expected reward if he/she forecasts the true distribution. A strictly proper score is a score such that equality above is achieved uniquely at $P = Q$. To maximize a strictly proper score, a forecaster has every incentive to give an “honest” forecast and has no incentive to “hedge”.

Why would you ever use an improper scoring rule?

Since score propriety (i.e. a score being proper) seems like such a basic requirement, why would anyone use an improper scoring rule? It turns out that there are some scores with nice properties but are not improper.

For example, imagine that we want to compare scores against some baseline forecaster. Given some score $S$, we can convert it to a skill score through normalization: \begin{aligned} SS(P, \omega) = \dfrac{S(P_{baseline}, \omega) - S(P, \omega)}{S(P_{baseline}, \omega)} = 1 - \dfrac{S(P, \omega)}{S(P_{baseline}, \omega)}, \end{aligned}

where $P_{baseline}$ is the probabilistic forecast which the baseline makes. If $S$ is negatively-oriented (smaller is better) and if the scores it produces are always non-negative, then the associated skill score is positively-oriented and in the range $(-\infty, 1]$.

Skill scores seem reasonable as they give us a fixed upper bound to aspire to. However, in general skill scores are improper. (See Reference 2 for a proof, and Section 2.3 of Reference 1 for another normalization scheme.)

Two other examples of improper scores that have been used are the naive linear score and mean squared error (MSE) (see Reference 3 for details). Note that mean squared error here is not the usual MSE we use for point forecasts: there is a more general definition for probabilistic forecasts.

Some examples of proper scoring rules

Reference 1 contains a number of examples of proper scoring rules. First, assume that the sample space for the response is categorical (with loss of generality, let it be $\{1, 2, \dots, J \}$), and let the probabilistic forecast be represented by $p_i = P(\omega = i)$. Here are 4 proper scoring rules for categorical variables (only 1-3 are strictly proper):

1. The Brier score: $S(P, i) = \displaystyle\sum_{j=1}^J (\delta_{ij} - p_j)^2$, where $\delta_{ij} = 1\{ i = j\}$. (I wrote a previous post on the Brier score here.)
2. The spherical score: For some parameter $\alpha > 1$, the (generalized) spherical score is defined as $S(P, i) = \dfrac{p_i^{\alpha - 1}}{(\sum_{j=1}^J p_j^\alpha)^{(\alpha - 1)/\alpha}}$. The traditional spherical score is the special case $\alpha = 2$.
3. The logarithmic score: $S(P, i) = \log p_i$.
4. The zero-one score: Let $M = \text{argmin}_j p_j$ be the set of modes of the probabilistic forecast. Then the zero-one score is defined as $S(P, i) = 1\{ i \in M \}$.

There are similar examples for continuous responses. For simplicity, assume that the probabilistic forecast $P$ has a density $p$ w.r.t. Lebesgue measure. (See Reference 1 for a more mathematically rigorous description.) Define \begin{aligned} \| p \|_\alpha = \left( \int p(\omega)^\alpha d\omega \right)^{1/\alpha}. \end{aligned}

1. The quadratic score: $S(P, \omega) = 2p(\omega) - \|p\|_2^2$.
2. The pseudospherical score: For $\alpha >$, $S(P, \omega) = p(\omega)^{\alpha - 1} / \| p\|_\alpha^{\alpha - 1}$.
3. The logarithmic score: $S(P, \omega) = \log p(\omega)$. This can be viewed as the limiting case of the pseudospherical score as $\alpha \rightarrow 1$. It is widely used, and is also known as the predictive deviance or ignorance score.

There are options for proper scoring rules if the probabilistic forecast $P$ does not have a density. Every probabilistic forecast will have a cumulative distribution function $F$, and we can construct scores from that. The continuous ranked probability score (CRPS) is one such score, defined as \begin{aligned} S(P, \omega) = -\int_{-\infty}^\infty [F(y) - 1\{ y \geq \omega \}]^2 dy. \end{aligned}

The CRPS corresponds to the integral of the Brier scores for the associated binary probability forecasts at all real-valued thresholds.

References:

1. Gneiting, T., and Raftery, A. E. (2007). Strictly proper scoring rules, prediction and estimation.
2. Murphy, A. H. (1973). Hedging and skill scores for probability forecasts.
3. Bröcker, J., and Smith, L. A. (2007). Scoring probabilistic forecasts: The importance of being proper.