2.1 tsibble objects

A time series can be thought of as a list of numbers (the measurements), along with some information about what times those numbers were recorded (the index). This information can be stored as a tsibble object in R.

The index variable

Suppose you have annual observations for the last few years:

Year Observation
2015 123
2016 39
2017 78
2018 52
2019 110

We turn this into a tsibble object using the tsibble() function:

tsibble objects extend tidy data frames (tibble objects) by introducing temporal structure. We have set the time series index to be the Year column, which associates the measurements (Observation) with the time of recording (Year).

For observations that are more frequent than once per year, we need to use a time class function on the index. For example, suppose we have a monthly dataset z:

This can be converted to a tsibble object using the yearmonth() function:

First, the Month column is being converted from text to a monthly time object with yearmonth(). As we are already working with a data frame, we add temporal information and convert the data frame to a tsibble using as_tsibble(). Note the addition of “[1M]” on the first line indicating this is monthly data.

Other time class functions can be used depending on the frequency of the observations.

Frequency Function
Annual start:end
Quarterly yearquarter()
Monthly yearmonth()
Weekly yearweek()
Daily as_date(), ymd()
Sub-daily as_datetime()

The key variables

A tsibble also allows multiple time series to be stored in a single object. Suppose you are interested in a dataset containing the fastest running times for women’s and men’s track races at the Olympics, from 100m to 10000m:

The summary above shows that this is a tsibble object, which contains 312 rows and 4 columns. Alongside this, “[4Y]” informs us that the interval of these observations is every four years. Below this is the key structure, which informs us that there are 14 separate time series in the tsibble. A preview of the first 10 observations is also shown, in which we can see a missing value occurs in 1916. This is because the Olympics were not held during World War I.

The 14 time series in this object are uniquely identified by the keys: the Length and Sex variables.

Working with tsibble objects

We have used several functions above in order to work with tsibble objects, including mutate() and select(). To illustrate these further, along with some other useful functions, we will use the PBS tsibble containing sales data on pharmaceutical products in Australia.

This contains monthly data on Medicare Australia prescription data from July 1991 to June 2008. These are classified according to various concession types, and Anatomical Therapeutic Chemical (ATC) indexes. For this example, we are interested in the Cost time series (total cost of scripts in Australian dollars).

We can use the filter() function to extract the A10 scripts:

This allows rows of the tsibble to be selected. Next we can simplify the resulting object by selecting the columns we will need in subsequent analysis.

The select() function allows us to select particular columns, while filter() allows us to select particular rows.

Note that the index variable Month would be returned even if it was not explicitly selected as it is required for a tsibble. If we had not selected the Concession and Type columns, an error would have occurred because the resulting tsibble would be invalid as there would be duplicate rows for each month.

Another useful function is summarise() which allows us to combine data across keys. For example, we may wish to compute total cost per month regardless of the Concession or Type keys.

The new variable TotalC is the sum of all Cost values for each month.

We can create new variables using the mutate() function. Here we change the units from dollars to millions of dollars:

Finally, we will save the resulting tsibble for examples later in this chapter.

At the end of this series of piped functions, we have used a right assignment (->), which is not common in R code, but is convenient at the end of a long series of commands as it continues the flow of the code.

Read a csv file and convert to a tsibble

Almost all of the data used in this book is already stored as tsibble objects. But most data lives in databases, MS-Excel files or csv files, before it is imported into R. So often the first step in creating a tsibble is to read in the data, and then identify the index and key variables.

For example, suppose we have the following daily data stored in a csv file (only the first 10 rows are shown). This data set provides information on the size of the prison population in Australia, disaggregated by state, gender, legal status and indigenous status. (Here, ATSI stands for Aboriginal or Torres Strait Islander.)

date state gender legal indigenous count
2005-03-01 ACT Female Remanded ATSI 0
2005-03-01 ACT Female Remanded Non-ATSI 2
2005-03-01 ACT Female Sentenced ATSI 0
2005-03-01 ACT Female Sentenced Non-ATSI 5
2005-03-01 ACT Male Remanded ATSI 7
2005-03-01 ACT Male Remanded Non-ATSI 58
2005-03-01 ACT Male Sentenced ATSI 5
2005-03-01 ACT Male Sentenced Non-ATSI 101
2005-03-01 NSW Female Remanded ATSI 51
2005-03-01 NSW Female Remanded Non-ATSI 131

We can read it into R, and create a tsibble object, by simply identifying which column contains the time index, and which columns are keys. The remaining columns are values — there can be many value columns, although in this case there is only one (count). The original csv file stored the dates as individual days, although the data is actually quarterly, so we need to convert the date variable to quarters.

This tsibble contains 64 separate time series corresponding to the combinations of the 8 states, 2 genders, 2 legal statuses and 2 indigenous statuses. Each of these series is 48 observations in length, from 2005 Q1 to 2016 Q4.

For a tsibble to be valid, it requires a unique index for each combination of keys. The tsibble() or as_tsibble() function will return an error if this is not true.

The seasonal period

Some graphics and some models will use the seasonal period of the data. The seasonal period is the number of observations before the seasonal pattern repeats. In most cases, this will be automatically detected using the time index variable.

Some common periods for different time intervals are shown in the table below:

Data Minute Hour Day Week Year
Quarters 4
Months 12
Weeks 52
Days 7 365.25
Hours 24 168 8766
Minutes 60 1440 10080 525960
Seconds 60 3600 86400 604800 31557600

For quarterly, monthly and weekly data, there is only one seasonal period — the number of observations within each year. Actually, there are not \(52\) weeks in a year, but \(365.25/7 = 52.18\) on average, allowing for a leap year every fourth year. Approximating seasonal periods to integers can be useful as many seasonal terms in models only support integer seasonal periods.

If the data is observed more than once per week, then there is often more than one seasonal pattern in the data. For example, data with daily observations might have weekly (period\(=7\)) or annual (period\(=365.25\)) seasonal patterns. Similarly, data that are observed every minute might have hourly (period\(=60\)), daily (period\(=24\times60=1440\)), weekly (period\(=24\times60\times7=10080\)) and annual seasonality (period\(=24\times60\times365.25=525960\)).

More complicated (and unusual) seasonal patterns can be specified using the period() function in the lubridate package.