5.9 Evaluating distributional forecast accuracy

The preceding measures all measure point forecast accuracy. When evaluating distributional forecasts, we need to use some other measures.

Quantile scores

Consider the Google stock price example from the previous section. Figure 5.17 shows an 80% prediction interval for the forecasts from the naïve model.

google_fc %>%
  filter(.model == "Naïve") %>%
  autoplot(bind_rows(google_2015, google_jan_2016), level = 80)
Naive forecasts of the Google stock price for Jan 2016, along with 80% prediction intervals.

Figure 5.17: Naive forecasts of the Google stock price for Jan 2016, along with 80% prediction intervals.

The lower limit of this prediction interval gives the 10th percentile (or 0.1 quantile) of the forecast distribution, so we would expect the actual value to lie below the lower limit about 10% of the time, and to lie above the lower limit about 90% of the time. When we compare the actual value to this percentile, we need to allow for the fact that it is more likely to be above than below.

More generally, suppose we are interested in the quantile forecast with probability \(p\) at future time \(t\), and let this be denoted by \(f_{p,t}\). That is, we expect the observation at time \(t\) to be less than \(f_{p,t}\) with probability \(p\). For example, the 10th percentile would be \(f_{0.10,t}\). If \(y_{t}\) denotes the observation at time \(t\), then the Quantile Score is \[ Q_{p,t} = \begin{cases} 2(1 - p) \big(f_{p,t} - y_{t}\big), & \text{if $y_{t} < f_{p,t}$}\\ 2p \big(y_{t} - f_{p,t}\big), & \text{if $y_{t} \ge f_{p,t}$} \end{cases} \] This is sometimes called the “pinball loss function” because a graph of it resembles the trajectory of a ball on a pinball table. The multiplier of 2 is often omitted, but including it makes the interpretation a little easier. A low value of \(Q_{p,t}\) indicates a better estimate of the quantile.

The quantile score can be interpreted like an absolute error. In fact, when \(p=0.5\), the quantile score \(Q_{0.5,t}\) is the same as the absolute error. For other values of \(p\), the “error” \((y_t - f_{p,t})\) is weighted to take account of how likely it is be positive or negative. If \(p>0.5\), \(Q_{p,t}\) gives a heavier penalty when the observation is greater than the estimated quantile than when the observation is less than the estimated quantile. The reverse is true for \(p<0.5\).

In Figure 5.17, the one-step-ahead 10% quantile forecast (for 4 January 2016) is \(f_{0.1,t} = 745\) and the observed value is \(y_t = 742\). Then \[ Q_{0.1,t} = 2(1-0.1) (744.54 - 741.84) = 4.86. \] This is easily computed using accuracy with the quantile_score function:

google_fc %>%
  filter(.model == "Naïve", Date == "2016-01-04") %>%
  accuracy(google_stock, list(qs = quantile_score), probs = 0.10)
#> # A tibble: 1 x 4
#>   .model Symbol .type    qs
#>   <chr>  <chr>  <chr> <dbl>
#> 1 Naïve  GOOG   Test   4.86

Winkler Score

It is often of interest to evaluate a prediction interval, rather than a few quantiles, and the Winkler score proposed by Winkler (1972) is designed for this purpose. If the \(100(1-\alpha)\)% prediction interval at time \(t\) is given by \([\ell_{\alpha,t}, u_{\alpha,t}]\), then the Winkler score is defined as the length of the interval plus a penalty if the observation is outside the interval: \[ W_{\alpha,t} = \begin{cases} (u_{\alpha,t} - \ell_{\alpha,t}) + \frac{2}{\alpha} (\ell_{\alpha,t} - y_t) & \text{if } y_t < \ell_{\alpha,t} \\ (u_{\alpha,t} - \ell_{\alpha,t}) & \text{if } \ell_{\alpha,t} \le y_t \le u_{\alpha,t} \\ (u_{\alpha,t} - \ell_{\alpha,t}) + \frac{2}{\alpha} (y_t - u_{\alpha,t}) & \text{if } y_t > u_{\alpha,t}. \end{cases} \] For observations that fall within the interval, the Winkler score is simply the length of the interval. So low scores are associated with narrow intervals. However, if the observation falls outside the interval, the penalty applies, with the penalty proportional to how far the observation is outside the interval.

Prediction intervals are usually constructed from quantiles by setting \(\ell_{\alpha,t} = f_{\alpha/2,t}\) and \(u_{\alpha,t} = f_{1-\alpha/2,t}\). If we add the corresponding quantile scores and divide by \(\alpha\), we get the Winkler score: \[ W_{\alpha,t} = (Q_{\alpha/2,t} + Q_{1-\alpha/2,t})/\alpha. \]

The one-step-ahead 80% interval shown in Figure 5.17 for 4 January 2016 is [744.54, 773.22], and the actual value was 741.84, so the Winkler score is \[ W_{\alpha,t} = (773.22 - 744.54) + \frac{2}{0.2} (744.54 - 741.84) = 55.68. \] This is easily computed using accuracy with the winkler_score function:

google_fc %>%
  filter(.model == "Naïve", Date == "2016-01-04") %>%
  accuracy(google_stock, list(winkler = winkler_score), level = 80)
#> # A tibble: 1 x 4
#>   .model Symbol .type winkler
#>   <chr>  <chr>  <chr>   <dbl>
#> 1 Naïve  GOOG   Test     55.7

Continuous Ranked Probability Score

Often we are interested in the whole forecasting distribution, rather than particular quantiles or prediction intervals. In that case, we can average the quantile scores over all values of \(p\) to obtain the Continuous Ranked Probability Score or CRPS (Gneiting & Katzfuss, 2014).

In the Google stock price example, we can compute the average CRPS value for all days in the test set. A CRPS value is a little like a weighted absolute error computed from the entire forecast distribution, where the weighting takes account of the probabilities.

google_fc %>%
  accuracy(google_stock, list(crps = CRPS))
#> # A tibble: 3 x 4
#>   .model Symbol .type  crps
#>   <chr>  <chr>  <chr> <dbl>
#> 1 Drift  GOOG   Test   33.5
#> 2 Mean   GOOG   Test   76.7
#> 3 Naïve  GOOG   Test   26.5

Here, the Naïve method is giving better distributional forecasts than the Drift or Mean methods.

Scale-free comparisons using skill scores

As with point forecasts, it is useful to be able to compare the distributional forecast accuracy of several methods across series on different scales. For point forecasts, we used scaled errors for that purpose. Another approach is to use skill scores. These can be used for both point forecast accuracy and distributional forecast accuracy.

With skill scores, we compute a forecast accuracy measure relative to some benchmark method. For example, if we use the Naïve method as a benchmark, and also compute forecasts using the Drift method, we can compute the CRPS skill score of the Drift method relative to the Naïve method as \[ \frac{\text{CRPS}_{N} - \text{CRPS}_{D}}{\text{CRPS}_{N}} \] This gives the proportion that the Drift method improves over the Naïve method based on CRPS. It is easy to compute using the accuracy() function.

google_fc %>%
  accuracy(
    google_stock %>% filter(year(Date) >= 2015),
    list(skill = skill_score(CRPS))
  )
#> # A tibble: 3 x 4
#>   .model Symbol .type  skill
#>   <chr>  <chr>  <chr>  <dbl>
#> 1 Drift  GOOG   Test  -0.266
#> 2 Mean   GOOG   Test  -1.90 
#> 3 Naïve  GOOG   Test   0

Of course, the skill score for Naïve is 0 because it can’t improve on itself. The other two methods have larger CRPS values than Naïve, so the skills scores are negative; the Drift method is 26.6% worse than the Naïve method.

When the data are seasonal, skill_score() will use a Seasonal Naïve benchmark method rather than the Naïve benchmark. This will work even when the benchmark forecast is not included in the fable object as skill_score() computes the benchmark forecasts that are needed. It is important that the data provided to accuracy() include only the training and test data to ensure the same training data are used for the benchmark forecasts. If we had not filtered the google_stock data to remove data prior to 2015, the benchmark CRPS would have used all available data, and then the skill value for Naïve would not have been 0 as the CRPS values would have been computed on two different training sets.

The skill_score() function can be used with any accuracy measure. For example, skill_score(MSE) provides a way of comparing MSE values across diverse series. However, it is important that the test set is large enough to allow reliable calculation of the error measure, especially in the denominator. For that reason, MASE or RMSSE are often preferable scale-free measures for point forecast accuracy.