4.1 Some simple statistics

Any numerical summary computed from a time series is a feature of that time series — the mean, minimum or maximum, for example.

The features can be computed using the features() function. For example, let’s compute the means of all the series in the Australian tourism data.

tourism %>% features(Trips, mean)
#> New names:
#> * `` -> ...1
#> # A tibble: 304 x 4
#>    Region         State              Purpose    ...1
#>    <chr>          <chr>              <chr>     <dbl>
#>  1 Adelaide       South Australia    Business 156.  
#>  2 Adelaide       South Australia    Holiday  157.  
#>  3 Adelaide       South Australia    Other     56.6 
#>  4 Adelaide       South Australia    Visiting 205.  
#>  5 Adelaide Hills South Australia    Business   2.66
#>  6 Adelaide Hills South Australia    Holiday   10.5 
#>  7 Adelaide Hills South Australia    Other      1.40
#>  8 Adelaide Hills South Australia    Visiting  14.2 
#>  9 Alice Springs  Northern Territory Business  14.6 
#> 10 Alice Springs  Northern Territory Holiday   31.9 
#> # … with 294 more rows

The new column V1 contains the mean of each time series.

It is useful to give the resulting feature columns names to help us remember where they came from. This can be done by using a list of functions.

tourism %>% features(Trips, list(mean=mean)) %>% arrange(mean)
#> # A tibble: 304 x 4
#>    Region          State              Purpose   mean
#>    <chr>           <chr>              <chr>    <dbl>
#>  1 Kangaroo Island South Australia    Other    0.340
#>  2 MacDonnell      Northern Territory Other    0.449
#>  3 Wilderness West Tasmania           Other    0.478
#>  4 Barkly          Northern Territory Other    0.632
#>  5 Clare Valley    South Australia    Other    0.898
#>  6 Barossa         South Australia    Other    1.02 
#>  7 Kakadu Arnhem   Northern Territory Other    1.04 
#>  8 Lasseter        Northern Territory Other    1.14 
#>  9 Wimmera         Victoria           Other    1.15 
#> 10 MacDonnell      Northern Territory Visiting 1.18 
#> # … with 294 more rows

Here we see that the series with least average number of visits was “Other” visits to Kangaroo Island in South Australia.

Rather than compute one feature at a time, it is convenient to compute many features at once. A common short summary of a data set is to compute five summary statistics: the minimum, first quartile, median, third quartile and maximum. These divide the data into four equal-size sections, each containing 25% of the data. The quantile() function can be used to compute them.

tourism %>% features(Trips, quantile, prob=seq(0,1,by=0.25))
#> # A tibble: 304 x 8
#>    Region        State             Purpose    `0%`  `25%`   `50%`  `75%` `100%`
#>    <chr>         <chr>             <chr>     <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
#>  1 Adelaide      South Australia   Busine…  68.7   134.   153.    177.   242.  
#>  2 Adelaide      South Australia   Holiday 108.    135.   154.    172.   224.  
#>  3 Adelaide      South Australia   Other    25.9    43.9   53.8    62.5  107.  
#>  4 Adelaide      South Australia   Visiti… 137.    179.   206.    229.   270.  
#>  5 Adelaide Hil… South Australia   Busine…   0       0      1.26    3.92  28.6 
#>  6 Adelaide Hil… South Australia   Holiday   0       5.77   8.52   14.1   35.8 
#>  7 Adelaide Hil… South Australia   Other     0       0      0.908   2.09   8.95
#>  8 Adelaide Hil… South Australia   Visiti…   0.778   8.91  12.2    16.8   81.1 
#>  9 Alice Springs Northern Territo… Busine…   1.01    9.13  13.3    18.5   34.1 
#> 10 Alice Springs Northern Territo… Holiday   2.81   16.9   31.5    44.8   76.5 
#> # … with 294 more rows

Here the minimum is labelled 0% and the maximum is labelled 100%.