4.3 STL Features

The STL decomposition discussed in Chapter 3 is the basis for several more features.

A time series decomposition can be used to measure the strength of trend and seasonality in a time series. Recall that the decomposition is written as \[ y_t = T_t + S_{t} + R_t, \] where \(T_t\) is the smoothed trend component, \(S_{t}\) is the seasonal component and \(R_t\) is a remainder component. For strongly trended data, the seasonally adjusted data should have much more variation than the remainder component. Therefore Var\((R_t)\)/Var\((T_t+R_t)\) should be relatively small. But for data with little or no trend, the two variances should be approximately the same. So we define the strength of trend as: \[ F_T = \max\left(0, 1 - \frac{\text{Var}(R_t)}{\text{Var}(T_t+R_t)}\right). \] This will give a measure of the strength of the trend between 0 and 1. Because the variance of the remainder might occasionally be even larger than the variance of the seasonally adjusted data, we set the minimal possible value of \(F_T\) equal to zero.

The strength of seasonality is defined similarly, but with respect to the detrended data rather than the seasonally adjusted data: \[ F_S = \max\left(0, 1 - \frac{\text{Var}(R_t)}{\text{Var}(S_{t}+R_t)}\right). \] A series with seasonal strength \(F_S\) close to 0 exhibits almost no seasonality, while a series with strong seasonality will have \(F_S\) close to 1 because Var\((R_t)\) will be much smaller than Var\((S_{t}+R_t)\).

These measures can be useful, for example, when you have a large collection of time series, and you need to find the series with the most trend or the most seasonality. These and other STL-based features are computed using the feat_stl() function.

tourism %>%
  features(Trips, feat_stl)
#> # A tibble: 304 x 12
#>    Region   State    Purpose trend_strength seasonal_strengt… seasonal_peak_y…
#>    <chr>    <chr>    <chr>            <dbl>             <dbl>            <dbl>
#>  1 Adelaide South A… Busine…          0.451             0.380                3
#>  2 Adelaide South A… Holiday          0.541             0.601                1
#>  3 Adelaide South A… Other            0.743             0.189                2
#>  4 Adelaide South A… Visiti…          0.433             0.446                1
#>  5 Adelaid… South A… Busine…          0.453             0.140                3
#>  6 Adelaid… South A… Holiday          0.512             0.244                2
#>  7 Adelaid… South A… Other            0.584             0.374                2
#>  8 Adelaid… South A… Visiti…          0.481             0.228                0
#>  9 Alice S… Norther… Busine…          0.526             0.224                0
#> 10 Alice S… Norther… Holiday          0.377             0.827                3
#> # … with 294 more rows, and 6 more variables: seasonal_trough_year <dbl>,
#> #   spikiness <dbl>, linearity <dbl>, curvature <dbl>, stl_e_acf1 <dbl>,
#> #   stl_e_acf10 <dbl>

We can then use these features in plots to identify what type of series are heavily trended and what are most seasonal.

tourism %>%
  features(Trips, feat_stl) %>%
  ggplot(aes(x = trend_strength, y = seasonal_strength_year,
             col = Purpose)) +
  geom_point() +
Seasonal strength vs trend strength for all tourism series.

Figure 4.1: Seasonal strength vs trend strength for all tourism series.

Clearly, holiday series are most seasonal which is unsurprising. The strongest trends tend to be in Western Australia and Victoria. The most seasonal series can also be easily identified and plotted.

tourism %>%
  features(Trips, feat_stl) %>%
    seasonal_strength_year == max(seasonal_strength_year)
  ) %>%
  left_join(tourism, by = c("State", "Region", "Purpose")) %>%
  ggplot(aes(x = Quarter, y = Trips)) +
  geom_line() +
  facet_grid(vars(State, Region, Purpose))
The most seasonal series in the Australian tourism data.

Figure 4.2: The most seasonal series in the Australian tourism data.

This shows holiday trips to the most popular ski region of Australia.

The feat_stl() function returns several more features other than those discussed above.

  • seasonal_peak_year indicates the timing of the peaks — which month or quarter contains the largest seasonal component. This tells us something about the nature of the seasonality. In the Australian tourism data, if Quarter 3 is the peak seasonal period, then people are travelling to the region in winter, whereas a peak in Quarter 1 suggests that the region is more popular in summer.
  • seasonal_trough_year indicates the timing of the troughs — which month or quarter contains the smallest seasonal component.
  • spikiness measures the prevalence of spikes in the remainder component \(R_t\) of the STL decomposition. It is the variance of the leave-one-out variances of \(R_t\).
  • linearity measures the linearity of the trend component of the STL decomposition. It is based on the coefficient of a linear regression applied to the trend component.
  • curvature measures the curvature of the trend component of the STL decomposition. It is based on the coefficient from an orthogonal quadratic regression applied to the trend component.
  • stl_e_acf1 is the first autocorrelation coefficient of the remainder series.
  • stl_e_acf10 is the sum of squares of the first ten autocorrelation coefficients of the remainder series.