4.5 Exploring Australian tourism data

All of the features included in the feasts package can be computed in one line like this.

tourism_features <- tourism %>%
  features(Trips, feature_set(pkgs = "feasts"))
#> # A tibble: 304 x 51
#>    Region State Purpose trend_strength seasonal_streng… seasonal_peak_y…
#>    <chr>  <chr> <chr>            <dbl>            <dbl>            <dbl>
#>  1 Adela… Sout… Busine…          0.451            0.380                3
#>  2 Adela… Sout… Holiday          0.541            0.601                1
#>  3 Adela… Sout… Other            0.743            0.189                2
#>  4 Adela… Sout… Visiti…          0.433            0.446                1
#>  5 Adela… Sout… Busine…          0.453            0.140                3
#>  6 Adela… Sout… Holiday          0.512            0.244                2
#>  7 Adela… Sout… Other            0.584            0.374                2
#>  8 Adela… Sout… Visiti…          0.481            0.228                0
#>  9 Alice… Nort… Busine…          0.526            0.224                0
#> 10 Alice… Nort… Holiday          0.377            0.827                3
#> # … with 294 more rows, and 45 more variables: seasonal_trough_year <dbl>,
#> #   spikiness <dbl>, linearity <dbl>, curvature <dbl>, stl_e_acf1 <dbl>,
#> #   stl_e_acf10 <dbl>, acf1 <dbl>, acf10 <dbl>, diff1_acf1 <dbl>,
#> #   diff1_acf10 <dbl>, diff2_acf1 <dbl>, diff2_acf10 <dbl>, season_acf1 <dbl>,
#> #   pacf5 <dbl>, diff1_pacf5 <dbl>, diff2_pacf5 <dbl>, season_pacf <dbl>,
#> #   zero_run_mean <dbl>, nonzero_squared_cv <dbl>, zero_start_prop <dbl>,
#> #   zero_end_prop <dbl>, lambda_guerrero <dbl>, kpss_stat <dbl>,
#> #   kpss_pvalue <dbl>, pp_stat <dbl>, pp_pvalue <dbl>, ndiffs <int>,
#> #   nsdiffs <int>, bp_stat <dbl>, bp_pvalue <dbl>, lb_stat <dbl>,
#> #   lb_pvalue <dbl>, var_tiled_var <dbl>, var_tiled_mean <dbl>,
#> #   shift_level_max <dbl>, shift_level_index <dbl>, shift_var_max <dbl>,
#> #   shift_var_index <dbl>, shift_kl_max <dbl>, shift_kl_index <dbl>,
#> #   spectral_entropy <dbl>, n_crossing_points <int>, longest_flat_spot <int>,
#> #   coef_hurst <dbl>, stat_arch_lm <dbl>

This gives 48 features for every combination of the three key variables (Region, State and Purpose).

We can treat this tibble like any data set and analyse it to find interesting observations or groups of observations.

We’ve already seen how we can plot one feature against another (Section 4.3). We can also do scatterplot matrices of groups of features. For example, let’s look at all the features involving seasonality in Figure 4.1. We have also included the Purpose variable.

tourism_features %>%
  select_at(vars(contains("season"), Purpose)) %>%
    seasonal_peak_year = glue::glue("Q{seasonal_peak_year+1}"),
    seasonal_trough_year = glue::glue("Q{seasonal_trough_year+1}"),
  ) %>%
  GGally::ggpairs(mapping = aes(colour = Purpose))
A scatterplot matrix of all the seasonal features for the Australian tourism data

Figure 4.1: A scatterplot matrix of all the seasonal features for the Australian tourism data

Here, the Purpose variable is mapped to colour. There is a lot of information in this figure, and we will highlight just a few things we can learn.

  • The three numerical measure related to seasonality (seasonal_strength_year, season_acf1 and season_pacf) are all positively correlated.
  • The bottom left panel and the top right panel both show that the most strongly seasonal series are related to holidays (as we saw previously).
  • The bar plots in the bottom row of the seasonal_peak_year and seasonal_trough_year columns show that seasonal peaks in Business travel occurs most often in Quarter 4, and least often in Quarter 2.

It is difficult to explore more than a handful of variables in this way. A useful way to handle many more variables is to use a dimension reduction technique such as principal components. This gives linear combinations of variables that explain the most variation in the original data. We can compute the principal components of the tourism features as follows.

pcs <- tourism_features %>%
  select(-State, -Region, -Purpose) %>%
  prcomp(scale = TRUE) %>%
pcs %>%
  ggplot(aes(x = .fittedPC1, y = .fittedPC2, col = Purpose)) +
  geom_point() +
  theme(aspect.ratio = 1)
A plot of the first two principal components, calculated from the 44 features of the Australian quarterly tourism data.

Figure 4.2: A plot of the first two principal components, calculated from the 44 features of the Australian quarterly tourism data.

Each point on Figure 4.2 represents one series and its location on the plot is based on all 48 features. The first principal component (.fittedPC1) is the linear combination of the features which explains the most variation in the data. The second principal component (.fittedPC2) is the linear combination which explains the next most variation in the data, while being uncorrelated with the first principal component. For more information about principal component dimension reduction, see Izenman (2008).

Figure 4.2 reveals a few things about the tourism data. First, the holiday series behave quite differently from the rest of the series. Almost all of the holiday series appear in the top half of the plot, while almost all of the remaining series appear in the bottom half of the plot. Clearly, the second principal component is distinguishing between holidays and other types of travel.

The plot also allows us to identify anomalous time series — series which have unusual feature combinations. These appear as points that are separate from the majority of series in Figure 4.2. There are four that stand out, and we can identify which series they correspond to as follows.

outliers <- pcs %>%
  filter(.fittedPC1 > 10) %>%
  select(Region, State, Purpose, .fittedPC1, .fittedPC2)
#> # A tibble: 4 x 5
#>   Region                 State             Purpose  .fittedPC1 .fittedPC2
#>   <chr>                  <chr>             <chr>         <dbl>      <dbl>
#> 1 Australia's North West Western Australia Business       13.4    -11.4  
#> 2 Australia's South West Western Australia Holiday        10.9      0.857
#> 3 Melbourne              Victoria          Holiday        12.3    -10.5  
#> 4 South Coast            New South Wales   Holiday        11.9      9.40
outliers %>%
  left_join(tourism, by = c("State", "Region", "Purpose")) %>%
  mutate(Series = glue::glue("{State}", "{Region}", "{Purpose}", .sep = "\n\n")) %>%
  ggplot(aes(x = Quarter, y = Trips)) +
  geom_line() +
  facet_grid(Series ~ ., scales = "free") +
  labs(title = "Outlying time series in PC space")

We can speculate why these series are identified as unusual.

  • Holiday visits to the south coast of NSW is highly seasonal but has almost no trend, whereas most holiday destinations in Australia show some trend over time.
  • Melbourne is an unusual holiday destination because it has almost no seasonality, whereas most holiday destinations in Australia have highly seasonal tourism.
  • The north western corner of Western Australia is unusual because it shows an increase in business tourism in the last few years of data, but little or no seasonality.
  • The south western corner of Western Australia is unusual because it shows both an increase in holiday tourism in the last few years of data and a high level of seasonality.