## 4.5 Exploring Australian tourism data

All of the features included in the feasts package can be computed in one line like this.

tourism_features <- tourism %>%
features(Trips, feature_set(pkgs="feasts"))
#> Warning: 304 errors (1 unique) encountered for feature 12
#>   call: fun(libname, pkgname)
#>   error: function 'exp_vec_restore' not provided by package 'vctrs'
#> Warning: 304 errors (1 unique) encountered for feature 13
#>   call: fun(libname, pkgname)
#>   error: function 'exp_vec_restore' not provided by package 'vctrs'
#> Warning: 304 errors (1 unique) encountered for feature 14
#>   call: fun(libname, pkgname)
#>   error: function 'exp_vec_restore' not provided by package 'vctrs'
#> Warning: 304 errors (1 unique) encountered for feature 15
#>   call: fun(libname, pkgname)
#>   error: function 'exp_vec_restore' not provided by package 'vctrs'
#> Warning: 304 errors (1 unique) encountered for feature 16
#>   call: fun(libname, pkgname)
#>   error: function 'exp_vec_restore' not provided by package 'vctrs'
#> Warning: 304 errors (1 unique) encountered for feature 17
#> [304] The ForeCA package must be installed to use this functionality. It can be installed with install.packages("ForeCA")
tourism_features
#> # A tibble: 304 x 42
#>    Region State Purpose trend_strength seasonal_streng… seasonal_peak_y…
#>    <chr>  <chr> <chr>            <dbl>            <dbl>            <dbl>
#>  1 Adela… Sout… Busine…          0.451            0.380                3
#>  2 Adela… Sout… Holiday          0.541            0.601                1
#>  3 Adela… Sout… Other            0.743            0.189                2
#>  4 Adela… Sout… Visiti…          0.433            0.446                1
#>  5 Adela… Sout… Busine…          0.453            0.140                3
#>  6 Adela… Sout… Holiday          0.512            0.244                2
#>  7 Adela… Sout… Other            0.584            0.374                2
#>  8 Adela… Sout… Visiti…          0.481            0.228                0
#>  9 Alice… Nort… Busine…          0.526            0.224                0
#> 10 Alice… Nort… Holiday          0.377            0.827                3
#> # … with 294 more rows, and 36 more variables: seasonal_trough_year <dbl>,
#> #   spikiness <dbl>, linearity <dbl>, curvature <dbl>, stl_e_acf1 <dbl>,
#> #   stl_e_acf10 <dbl>, acf1 <dbl>, acf10 <dbl>, diff1_acf1 <dbl>,
#> #   diff1_acf10 <dbl>, diff2_acf1 <dbl>, diff2_acf10 <dbl>, season_acf1 <dbl>,
#> #   pacf5 <dbl>, diff1_pacf5 <dbl>, diff2_pacf5 <dbl>, season_pacf <dbl>,
#> #   zero_run_mean <dbl>, nonzero_squared_cv <dbl>, zero_start_prop <dbl>,
#> #   zero_end_prop <dbl>, lambda_guerrero <dbl>, kpss_stat <dbl>,
#> #   kpss_pvalue <dbl>, pp_stat <dbl>, pp_pvalue <dbl>, ndiffs <int>,
#> #   nsdiffs <int>, bp_stat <dbl>, bp_pvalue <dbl>, lb_stat <dbl>,
#> #   lb_pvalue <dbl>, n_crossing_points <int>, n_flat_spots <int>,
#> #   coef_hurst <dbl>, stat_arch_lm <dbl>

This gives 39 features for every combination of the three key variables (Region, State and Purpose).

We can treat this tibble like any data set and analyse it find interesting observations or groups of observations.

We’ve already seen how we can plot one feature against another (Section 4.3). We can also do scatterplot matrices of groups of features. For example, let’s look at all the features involving seasonality in Figure 4.1. We have also included the Purpose variable.

tourism_features %>%
select_at(vars(contains("season"), Purpose)) %>%
mutate(
seasonal_peak_year = glue::glue("Q{seasonal_peak_year+1}"),
seasonal_trough_year = glue::glue("Q{seasonal_trough_year+1}"),
) %>%
GGally::ggpairs(mapping = aes(colour=Purpose))

Here, the Purpose variable is mapped to colour. There is a lot of information in this figure, and we will highlight just a few things we can learn.

• The three numerical measure related to seasonality (seasonal_strength_year, season_acf1 and season_pacf) are all positively correlated.
• The bottom left panel and the top right panel both show that the most strongly seasonal series are related to holidays (as we saw previously).
• The bar plots in the bottom row of the seasonal_peak_year and seasonal_trough_year columns show that seasonal peaks in Business travel occurs most often in Quarter 4, and least often in Quarter 2.

It is difficult to explore more than a handful of variables in this way. A useful way to handle many more variables is to use a dimension reduction technique such as principal components. This gives linear combinations of variables that explain the most variation in the original data. We can compute the principal components of the tourism features as follows.

library(broom)
pcs <- tourism_features %>%
select(-State, -Region, -Purpose) %>%
prcomp(scale=TRUE) %>%
augment(tourism_features)
pcs %>%
ggplot(aes(x=.fittedPC1, y=.fittedPC2, col=Purpose)) +
geom_point() + theme(aspect.ratio=1)

Each point on Figure 4.2 represents one series and its location on the plot is based on all 39 features. The first principal component (.fittedPC1) is the linear combination of the features which explains the most variation in the data. The second principal component (.fittedPC2) is the linear combination which explains the next most variation in the data, while being uncorrelated with the first principal component. For more information about principal component dimension reduction, see Izenman (2008).

Figure 4.2 reveals a few things about the tourism data. First, the holiday series behave quite differently from the rest of the series. Almost all of the holiday series appear in the top half of the plot, while almost all of the remaining series appear in the bottom half of the plot. Clearly, the second principal component is distinguishing between holidays and other types of travel.

The plot also allows us to identify anomalous time series — series which have unusual feature combinations. These appear as points that are separate from the majority of series in Figure 4.2. There are three which stand out, and we can identify which series they correspond to as follows.

outliers <- pcs %>%
filter(.fittedPC1 > 10.5) %>%
select(Region, State, Purpose, .fittedPC1, .fittedPC2)
outliers
#> # A tibble: 2 x 5
#>   Region                 State             Purpose  .fittedPC1 .fittedPC2
#>   <chr>                  <chr>             <chr>         <dbl>      <dbl>
#> 1 Australia's North West Western Australia Business       14.5      -8.43
#> 2 Melbourne              Victoria          Holiday        12.9      -8.46
outliers %>%
left_join(tourism, by = c("State", "Region", "Purpose")) %>%
mutate(Series = glue::glue("{State}", "{Region}", "{Purpose}", .sep = "\n\n")) %>%
ggplot(aes(x = Quarter, y = Trips)) +
geom_line() +
facet_grid(Series ~ ., scales='free') +
ggtitle("Outlying time series in PC space")

We can speculate why these series are identified as unusual.

• Melbourne is an unusual holiday destination because it has almost no seasonality, whereas most holiday destinations in Australia have highly seasonal tourism.
• The north western corner of Western Australia is unusual because it is has an increase in business tourism in the last few years of data, but little or no seasonality.
• The south western corner of Western Australia is unusual because it is has both an increase in holiday tourism in the last few years of data and a high level of seasonality.

### Bibliography

Izenman, A. J. (2008). Modern multivariate statistical techniques: Regression, classification and manifold learning. New York: Springer.