2/5 Evaluating forecast accuracy

Fore­cast accu­racy measures

Let y_{i} denote the ith obser­va­tion and \hat{y}_{i} denote a fore­cast of y_{i}.

Scale-dependent errors

The fore­cast error is sim­ply e_{i}=y_{i}-\hat{y}_{i}, which is on the same scale as the data. Accu­racy mea­sures that are based on e_{i} are there­fore scale-dependent and can­not be used to make com­par­isons between series that are on dif­fer­ent scales.

The two most com­monly used scale-dependent mea­sures are based on the absolute errors or squared errors:

    \begin{align*} \text{\hbox to 6cm{Mean absolute error:\hfil MAE}} & = \text{mean}(|e_{i}|),\\[0.3cm] \text{\hbox to 6cm{Root mean squared error:\hfil RMSE}} & = \sqrt{\text{mean}(e_{i}^2)}. \end{align*}

When com­par­ing fore­cast meth­ods on a sin­gle data set, the MAE is pop­u­lar as it is easy to under­stand and compute.

Per­cent­age errors

The per­cent­age error is given by p_{i} = 100 e_{i}/y_{i}. Per­cent­age errors have the advan­tage of being scale-independent, and so are fre­quently used to com­pare fore­cast per­for­mance between dif­fer­ent data sets. The most com­monly used mea­sure is:

    \[ \text{Mean absolute percentage error:\qquad MAPE} = \text{mean}(|p_{i}|). \]

Mea­sures based on per­cent­age errors have the dis­ad­van­tage of being infi­nite or unde­fined if y_{i}=0 for any i in the period of inter­est, and hav­ing extreme val­ues when any y_{i} is close to zero. Another prob­lem with per­cent­age errors that is often over­looked is that they assume a mean­ing­ful zero. For exam­ple, a per­cent­age error makes no sense when mea­sur­ing the accu­racy of tem­per­a­ture fore­casts on the Fahren­heit or Cel­sius scales.

They also have the dis­ad­van­tage that they put a heav­ier penalty on pos­i­tive errors than on neg­a­tive errors. This obser­va­tion led to the use of the so-called “sym­met­ric” MAPE (sMAPE) pro­posed by Arm­strong (1985, p.348), which was used in the M3 fore­cast­ing com­pe­ti­tion. It is defined by

    \[ \text{sMAPE} = \text{mean}\left(200|y_{i} - \hat{y}_{i}|/(y_{i}+\hat{y}_{i})\right). \]

How­ever, if y_{i} is close to zero, \hat{y}_{i} is also likely to be close to zero. Thus, the mea­sure still involves divi­sion by a num­ber close to zero, mak­ing the cal­cu­la­tion unsta­ble. Also, the value of sMAPE can be neg­a­tive, so it is not really a mea­sure of “absolute per­cent­age errors” at all.

Hyn­d­man and Koehler (2006) rec­om­mend that the sMAPE not be used. It is included here only because it is widely used, although we will not use it in this book.

Scaled errors

Scaled errors were pro­posed by Hyn­d­man and Koehler (2006) as an alter­na­tive to using per­cent­age errors when com­par­ing fore­cast accu­racy across series on dif­fer­ent scales. They pro­posed scal­ing the errors based on the train­ing MAE from a sim­ple fore­cast method. For a non-seasonal time series, a use­ful way to define a scaled error uses naïve forecasts:

    \[ q_{j} = \frac{\displaystyle e_{j}}{\displaystyle\frac{1}{T-1}\sum_{t=2}^T |y_{t}-y_{t-1}|}. \]

Because the numer­a­tor and denom­i­na­tor both involve val­ues on the scale of the orig­i­nal data, q_{j} is inde­pen­dent of the scale of the data. A scaled error is less than one if it arises from a bet­ter fore­cast than the aver­age naïve fore­cast com­puted on the train­ing data. Con­versely, it is greater than one if the fore­cast is worse than the aver­age naïve fore­cast com­puted on the train­ing data. For sea­sonal time series, a scaled error can be defined using sea­sonal naïve forecasts:

    \[ q_{j} = \frac{\displaystyle e_{j}}{\displaystyle\frac{1}{T-m}\sum_{t=m+1}^T |y_{t}-y_{t-m}|}. \]

For cross-sectional data, a scaled error can be defined as

    \[ q_{j} = \frac{\displaystyle e_{j}}{\displaystyle\frac{1}{N}\sum_{i=1}^N |y_i-\bar{y}|}. \]

In this case, the com­par­i­son is with the mean fore­cast. (This doesn’t work so well for time series data as there may be trends and other pat­terns in the data, mak­ing the mean a poor com­par­i­son. Hence, the naïve fore­cast is rec­om­mended when using time series data.)

The mean absolute scaled error is simply

    \[ \text{MASE} = \text{mean}(|q_{j}|). \]

Sim­i­larly, the mean squared scaled error (MSSE) can be defined where the errors (on the train­ing data and test data) are squared instead of using absolute values.

Exam­ples

Fig­ure 2.17: Fore­casts of Aus­tralian quar­terly beer pro­duc­tion using data up to the end of 2005.

R code
beer2 <- window(ausbeer,start=1992,end=2006-.1)
 
beerfit1 <- meanf(beer2,h=11)
beerfit2 <- rwf(beer2,h=11)
beerfit3 <- snaive(beer2,h=11)
 
plot(beerfit1, plot.conf=FALSE,
  main="Forecasts for quarterly beer production")
lines(beerfit2$mean,col=2)
lines(beerfit3$mean,col=3)
lines(ausbeer)
legend("topright", lty=1, col=c(4,2,3),
  legend=c("Mean method","Naive method","Seasonal naive method"))

Fig­ure 2.17 shows three fore­cast meth­ods applied to the quar­terly Aus­tralian beer pro­duc­tion using data only to the end of 2005. The actual val­ues for the period 2006–2008 are also shown. We com­pute the fore­cast accu­racy mea­sures for this period.

Method RMSE MAE MAPE MASE
Mean method 38.01 33.78 8.17 0.61
Naïve method 70.91 63.91 15.88 1.15
Sea­sonal naïve method 12.97 11.27 2.73 0.20
R code
beer3 <- window(ausbeer, start=2006)
accuracy(beerfit1, beer3)
accuracy(beerfit2, beer3)
accuracy(beerfit3, beer3)

It is obvi­ous from the graph that the sea­sonal naïve method is best for these data, although it can still be improved, as we will dis­cover later. Some­times, dif­fer­ent accu­racy mea­sures will lead to dif­fer­ent results as to which fore­cast method is best. How­ever, in this case, all the results point to the sea­sonal naïve method as the best of these three meth­ods for this data set.

To take a non-seasonal exam­ple, con­sider the Dow Jones Index. The fol­low­ing graph shows the 250 obser­va­tions end­ing on 15 July 1994, along with fore­casts of the next 42 days obtained from three dif­fer­ent methods.

Fig­ure 2.18: Fore­casts of the Dow Jones Index from 16 July 1994.

R code
dj2 <- window(dj, end=250)
plot(dj2, main="Dow Jones Index (daily ending 15 Jul 94)",
  ylab="", xlab="Day", xlim=c(2,290))
lines(meanf(dj2,h=42)$mean, col=4)
lines(rwf(dj2,h=42)$mean, col=2)
lines(rwf(dj2,drift=TRUE,h=42)$mean, col=3)
legend("topleft", lty=1, col=c(4,2,3),
  legend=c("Mean method","Naive method","Drift method"))
lines(dj)

 

Method RMSE MAE MAPE MASE
Mean method 148.24 142.42 3.66 8.70
Naïve method 62.03 54.44 1.40 3.32
Drift method 53.70 45.73 1.18 2.79
R code
dj3 <- window(dj, start=251)
accuracy(meanf(dj2,h=42), dj3)
accuracy(rwf(dj2,h=42), dj3)
accuracy(rwf(dj2,drift=TRUE,h=42), dj3)

Here, the best method is the drift method (regard­less of which accu­racy mea­sure is used).

Train­ing and test sets

It is impor­tant to eval­u­ate fore­cast accu­racy using gen­uine fore­casts. That is, it is invalid to look at how well a model fits the his­tor­i­cal data; the accu­racy of fore­casts can only be deter­mined by con­sid­er­ing how well a model per­forms on new data that were not used when fit­ting the model. When choos­ing mod­els, it is com­mon to use a por­tion of the avail­able data for fit­ting, and use the rest of the data for test­ing the model, as was done in the above exam­ples. Then the test­ing data can be used to mea­sure how well the model is likely to fore­cast on new data.

The size of the test set is typ­i­cally about 20% of the total sam­ple, although this value depends on how long the sam­ple is and how far ahead you want to fore­cast. The size of the test set should ide­ally be at least as large as the max­i­mum fore­cast hori­zon required.

The fol­low­ing points should be noted.

  • A model which fits the data well does not nec­es­sar­ily fore­cast well.
  • A per­fect fit can always be obtained by using a model with enough parameters.
  • Over-fitting a model to data is as bad as fail­ing to iden­tify the sys­tem­atic pat­tern in the data.

Some ref­er­ences describe the test set as the “hold-out set” because these data are “held out” of the data used for fit­ting. Other ref­er­ences call the train­ing set the “in-sample data” and the test set the “out-of-sample data”. We pre­fer to use “train­ing set” and “test set” in this book.

Cross-validation

A more sophis­ti­cated ver­sion of training/test sets is cross-validation.

For cross-sectional data, cross-validation works as follows.

  1. Select obser­va­tion i for the test set, and use the remain­ing obser­va­tions in the train­ing set. Com­pute the error on the test observation.
  2. Repeat the above step for i=1,2,\dots,N where N is the total num­ber of observations.
  3. Com­pute the fore­cast accu­racy mea­sures based on the errors obtained.

This is a much more effi­cient use of the avail­able data, as you only omit one obser­va­tion at each step. How­ever, it can be very time con­sum­ing to implement.

For time series data, the pro­ce­dure is sim­i­lar but the train­ing set con­sists only of obser­va­tions that occurred prior to the obser­va­tion that forms the test set. Thus, no future obser­va­tions can be used in con­struct­ing the fore­cast. How­ever, it is not pos­si­ble to get a reli­able fore­cast based on a very small train­ing set, so the ear­li­est obser­va­tions are not con­sid­ered as test sets. Sup­pose k obser­va­tions are required to pro­duce a reli­able fore­cast. Then the process works as follows.

  1. Select the obser­va­tion at time k+i for the test set, and use the obser­va­tions at times 1,2,\dots,k+i-1 to esti­mate the fore­cast­ing model. Com­pute the error on the fore­cast for time k+i.
  2. Repeat the above step for i=1,2,\dots,T-k where T is the total num­ber of observations.
  3. Com­pute the fore­cast accu­racy mea­sures based on the errors obtained.

This pro­ce­dure is some­times known as a “rolling fore­cast­ing ori­gin” because the “ori­gin” (k+i-1) at which the fore­cast is based rolls for­ward in time.

With time series fore­cast­ing, one-step fore­casts may not be as rel­e­vant as multi-step fore­casts. In this case, the cross-validation pro­ce­dure based on a rolling fore­cast­ing ori­gin can be mod­i­fied to allow multi-step errors to be used. Sup­pose we are inter­ested in mod­els that pro­duce good h–step-ahead fore­casts.

  1. Select the obser­va­tion at time k+h+i-1 for the test set, and use the obser­va­tions at times 1,2,\dots,k+i-1 to esti­mate the fore­cast­ing model. Com­pute the h–step error on the fore­cast for time k+h+i-1.
  2. Repeat the above step for i=1,2,\dots,T-k-h+1 where T is the total num­ber of observations.
  3. Com­pute the fore­cast accu­racy mea­sures based on the errors obtained.

When h=1, this gives the same pro­ce­dure as out­lined above.


Pro­ceed to Sec­tion 2/6.

Comments are closed.