4.1 Some simple statistics
Any numerical summary computed from a time series is a feature of that time series — the mean, minimum or maximum, for example.
The features can be computed using the features()
function. For example, let’s compute the means of all the series in the Australian tourism data.
%>% features(Trips, mean)
tourism #> New names:
#> * `` -> ...1
#> # A tibble: 304 x 4
#> Region State Purpose ...1
#> <chr> <chr> <chr> <dbl>
#> 1 Adelaide South Australia Business 156.
#> 2 Adelaide South Australia Holiday 157.
#> 3 Adelaide South Australia Other 56.6
#> 4 Adelaide South Australia Visiting 205.
#> 5 Adelaide Hills South Australia Business 2.66
#> 6 Adelaide Hills South Australia Holiday 10.5
#> 7 Adelaide Hills South Australia Other 1.40
#> 8 Adelaide Hills South Australia Visiting 14.2
#> 9 Alice Springs Northern Territory Business 14.6
#> 10 Alice Springs Northern Territory Holiday 31.9
#> # … with 294 more rows
The new column ...1
contains the mean of each time series.
It is useful to give the resulting feature columns names to help us remember where they came from. This can be done by using a list of functions.
%>%
tourism features(Trips, list(mean = mean)) %>%
arrange(mean)
#> # A tibble: 304 x 4
#> Region State Purpose mean
#> <chr> <chr> <chr> <dbl>
#> 1 Kangaroo Island South Australia Other 0.340
#> 2 MacDonnell Northern Territory Other 0.449
#> 3 Wilderness West Tasmania Other 0.478
#> 4 Barkly Northern Territory Other 0.632
#> 5 Clare Valley South Australia Other 0.898
#> 6 Barossa South Australia Other 1.02
#> 7 Kakadu Arnhem Northern Territory Other 1.04
#> 8 Lasseter Northern Territory Other 1.14
#> 9 Wimmera Victoria Other 1.15
#> 10 MacDonnell Northern Territory Visiting 1.18
#> # … with 294 more rows
Here we see that the series with least average number of visits was “Other” visits to Kangaroo Island in South Australia.
Rather than compute one feature at a time, it is convenient to compute many features at once. A common short summary of a data set is to compute five summary statistics: the minimum, first quartile, median, third quartile and maximum. These divide the data into four equal-size sections, each containing 25% of the data. The quantile()
function can be used to compute them.
%>% features(Trips, quantile)
tourism #> # A tibble: 304 x 8
#> Region State Purpose `0%` `25%` `50%` `75%` `100%`
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Adelaide South Australia Busine… 68.7 134. 153. 177. 242.
#> 2 Adelaide South Australia Holiday 108. 135. 154. 172. 224.
#> 3 Adelaide South Australia Other 25.9 43.9 53.8 62.5 107.
#> 4 Adelaide South Australia Visiti… 137. 179. 206. 229. 270.
#> 5 Adelaide Hil… South Australia Busine… 0 0 1.26 3.92 28.6
#> 6 Adelaide Hil… South Australia Holiday 0 5.77 8.52 14.1 35.8
#> 7 Adelaide Hil… South Australia Other 0 0 0.908 2.09 8.95
#> 8 Adelaide Hil… South Australia Visiti… 0.778 8.91 12.2 16.8 81.1
#> 9 Alice Springs Northern Territo… Busine… 1.01 9.13 13.3 18.5 34.1
#> 10 Alice Springs Northern Territo… Holiday 2.81 16.9 31.5 44.8 76.5
#> # … with 294 more rows
Here the minimum is labelled 0%
and the maximum is labelled 100%
.