8.6 Estimation and model selection

Estimating ETS models

An alternative to estimating the parameters by minimising the sum of squared errors is to maximise the “likelihood”. The likelihood is the probability of the data arising from the specified model. Thus, a large likelihood is associated with a good model. For an additive error model, maximising the likelihood (assuming normally distributed errors) gives the same results as minimising the sum of squared errors. However, different results will be obtained for multiplicative error models. In this section, we will estimate the smoothing parameters \(\alpha\), \(\beta\), \(\gamma\) and \(\phi\), and the initial states \(\ell_0\), \(b_0\), \(s_0,s_{-1},\dots,s_{-m+1}\), by maximising the likelihood.

The possible values that the smoothing parameters can take are restricted. Traditionally, the parameters have been constrained to lie between 0 and 1 so that the equations can be interpreted as weighted averages. That is, \(0< \alpha,\beta^*,\gamma^*,\phi<1\). For the state space models, we have set \(\beta=\alpha\beta^*\) and \(\gamma=(1-\alpha)\gamma^*\). Therefore, the traditional restrictions translate to \(0< \alpha <1\), \(0 < \beta < \alpha\) and \(0< \gamma < 1-\alpha\). In practice, the damping parameter \(\phi\) is usually constrained further to prevent numerical difficulties in estimating the model. In R, it is restricted so that \(0.8<\phi<0.98\).

Another way to view the parameters is through a consideration of the mathematical properties of the state space models. The parameters are constrained in order to prevent observations in the distant past having a continuing effect on current forecasts. This leads to some admissibility constraints on the parameters, which are usually (but not always) less restrictive than the traditional constraints region (Hyndman et al., 2008, Ch10). For example, for the ETS(A,N,N) model, the traditional parameter region is \(0< \alpha <1\) but the admissible region is \(0< \alpha <2\). For the ETS(A,A,N) model, the traditional parameter region is \(0<\alpha<1\) and \(0<\beta<\alpha\) but the admissible region is \(0<\alpha<2\) and \(0<\beta<4-2\alpha\).

Model selection

A great advantage of the ETS statistical framework is that information criteria can be used for model selection. The AIC, AIC\(_{\text{c}}\) and BIC, introduced in Section 6.5, can be used here to determine which of the ETS models is most appropriate for a given time series.

For ETS models, Akaike’s Information Criterion (AIC) is defined as \[ \text{AIC} = -2\log(L) + 2k, \] where \(L\) is the likelihood of the model and \(k\) is the total number of parameters and initial states that have been estimated (including the residual variance).

The AIC corrected for small sample bias (AIC\(_\text{c}\)) is defined as \[ \text{AIC}_{\text{c}} = \text{AIC} + \frac{k(k+1)}{T-k-1}, \] and the Bayesian Information Criterion (BIC) is \[ \text{BIC} = \text{AIC} + k[\log(T)-2]. \]

Three of the combinations of (Error, Trend, Seasonal) can lead to numerical difficulties. Specifically, the models that can cause such instabilities are ETS(A,N,M), ETS(A,A,M), and ETS(A,A\(_d\),M), due to division by values potentially close to zero in the state equations. We normally do not consider these particular combinations when selecting a model.

Models with multiplicative errors are useful when the data are strictly positive, but are not numerically stable when the data contain zeros or negative values. Therefore, multiplicative error models will not be considered if the time series is not strictly positive. In that case, only the six fully additive models will be applied.

Example: Domestic holiday tourist visitor nights in Australia

We now employ the ETS statistical framework to forecast Australian holiday tourism over the period 2016–2019. We let the ETS() function select the model by minimising the AICc.

The model selected is ETS(M,N,M): \[\begin{align*} y_{t} &= \ell_{t-1}s_{t-m}(1 + \varepsilon_t)\\ \ell_t &= \ell_{t-1}(1 + \alpha \varepsilon_t)\\ s_t &= s_{t-m}(1+ \gamma \varepsilon_t). \end{align*}\]

The parameter estimates are \(\hat\alpha=0.3578\), and \(\hat\gamma=0.000969\). The output also returns the estimates for the initial states \(\ell_0\), \(s_{0}\), \(s_{-1}\), \(s_{-2}\) and \(s_{-3}.\) Compare these with the values obtained for the equivalent Holt-Winters method with multiplicative seasonality presented in Table 8.4. The ETS(M,N,M) model will give different point forecasts to the multiplicative Holt-Winters’ method, because the parameters have been estimated differently. With the ETS() function, the default estimation method is maximum likelihood rather than minimum sum of squares.

Figure 8.9 shows the states over time, while Figure 8.11 shows point forecasts and prediction intervals generated from the model. The small values of \(\gamma\) indicates that the seasonal components change very little over time. The narrow prediction intervals indicate that the series is relatively easy to forecast due to the strong trend and seasonality.

Graphical representation of the estimated states over time.

Figure 8.9: Graphical representation of the estimated states over time.

Because this model has multiplicative errors, the residuals are not equivalent to the one-step training errors. The residuals are given by \(\hat{\varepsilon}_t\), while the one-step training errors are defined as \(y_t - \hat{y}_{t|t-1}\). We can obtain both using the residuals() function.

Residuals and one-step forecast errors from the ETS(M,N,M) model.

Figure 8.10: Residuals and one-step forecast errors from the ETS(M,N,M) model.

The type argument is used in the residuals() function to distinguish between residuals and forecast errors. The default is type='innovation' which gives regular residuals.

Bibliography

Hyndman, R. J., Koehler, A. B., Ord, J. K., & Snyder, R. D. (2008). Forecasting with exponential smoothing: The state space approach. Berlin: Springer-Verlag. http://www.exponentialsmoothing.net