What is a proper scoring rule?
In the realm of forecasting, a scoring rule is a way to measure how good a probabilistic forecast is.
Mathematically, if is the set of possible outcomes, then a probabilistic forecast is a probability distribution on (i.e. how likely each possible outcome is). A scoring rule (or a score) is a function that takes a probability distribution on (denoted by ) and an outcome as input and returns a real-valued number as output. If is the probabilistic forecast and is the actual outcome, then is interpreted as the reward (loss resp.) to the forecaster if the score is positively-oriented (negatively-oriented resp.).
Assume that the true probability distribution is denoted by . For each probabilistic forecast , we can define the expected score as
A positively oriented scoring rule is said to be proper if for all probability distributions and , we have
In other words, for a proper score, the forecaster maximizes the expected reward if he/she forecasts the true distribution. A strictly proper score is a score such that equality above is achieved uniquely at . To maximize a strictly proper score, a forecaster has every incentive to give an “honest” forecast and has no incentive to “hedge”.
Why would you ever use an improper scoring rule?
Since score propriety (i.e. a score being proper) seems like such a basic requirement, why would anyone use an improper scoring rule? It turns out that there are some scores with nice properties but are not improper.
For example, imagine that we want to compare scores against some baseline forecaster. Given some score , we can convert it to a skill score through normalization:
where is the probabilistic forecast which the baseline makes. If is negatively-oriented (smaller is better) and if the scores it produces are always non-negative, then the associated skill score is positively-oriented and in the range .
Skill scores seem reasonable as they give us a fixed upper bound to aspire to. However, in general skill scores are improper. (See Reference 2 for a proof, and Section 2.3 of Reference 1 for another normalization scheme.)
Two other examples of improper scores that have been used are the naive linear score and mean squared error (MSE) (see Reference 3 for details). Note that mean squared error here is not the usual MSE we use for point forecasts: there is a more general definition for probabilistic forecasts.
Some examples of proper scoring rules
Reference 1 contains a number of examples of proper scoring rules. First, assume that the sample space for the response is categorical (with loss of generality, let it be ), and let the probabilistic forecast be represented by . Here are 4 proper scoring rules for categorical variables (only 1-3 are strictly proper):
- The Brier score: , where . (I wrote a previous post on the Brier score here.)
- The spherical score: For some parameter , the (generalized) spherical score is defined as . The traditional spherical score is the special case .
- The logarithmic score: .
- The zero-one score: Let be the set of modes of the probabilistic forecast. Then the zero-one score is defined as .
There are similar examples for continuous responses. For simplicity, assume that the probabilistic forecast has a density w.r.t. Lebesgue measure. (See Reference 1 for a more mathematically rigorous description.) Define
- The quadratic score: .
- The pseudospherical score: For , .
- The logarithmic score: . This can be viewed as the limiting case of the pseudospherical score as . It is widely used, and is also known as the predictive deviance or ignorance score.
There are options for proper scoring rules if the probabilistic forecast does not have a density. Every probabilistic forecast will have a cumulative distribution function , and we can construct scores from that. The continuous ranked probability score (CRPS) is one such score, defined as
The CRPS corresponds to the integral of the Brier scores for the associated binary probability forecasts at all real-valued thresholds.
- Gneiting, T., and Raftery, A. E. (2007). Strictly proper scoring rules, prediction and estimation.
- Murphy, A. H. (1973). Hedging and skill scores for probability forecasts.
Bröcker, J., and Smith, L. A. (2007). Scoring probabilistic forecasts: The importance of being proper.
What do you mean by “true probability distribution”? I understand that if I was tossing a coin that one could think about a true probability distribution, but with weather forecasts, how would you define that?
I think one can end up getting pretty philosophical with this, depending on how far down the rabbit hole one wants to go….
One way you can think about the “true probability distribution” is via the frequentist perspective: it is the “long-run frequency of repeatable experiments”. For tossing a coin, one can imagine tossing the same coin over and over again, and assuming that each coin toss is indistinguishable from the others. For weather forecasts, e.g. “the distribution of weather outcomes for tomorrow”, I like to use a thought experiment where I imagine several parallel universes. The probability of being sunny tomorrow is just the proportion of universes where it is sunny tomorrow.
It’s complicated because “true probability” only exists in the context of a model (which means it’s not really “true” at all). So the only way to make the thought experiment work would be to fix to a particular weather forecast model (which would determine the sizes of the differences between your different universes). But I looked in the literature and there is a paper (https://doi.org/10.1175/WAF-D-19-0205.1) which manages to define proper without using “True probability” (see the paragraph after equation 3). That seems like a nice approach to me.