3.1 A tidy forecasting workflow

The process of producing forecasts for time series data can be broken down into a few steps.

To illustrate the process, we will fit linear trend models to national GDP data stored in global_economy.

Data preparation (tidy)

The first step in forecasting is to prepare data in the correct format. This process may involve loading in data, identifying missing values, filtering the time series and other pre-processing tasks. The functionality provided by tsibble and other packages in the tidyverse substantially simplifies this step.

Many models have different data requirements, some require the series to be in time order, others require no missing values. Checking your data is an essential step to understanding its features and it is useful to do before models are estimated.

The way in which your data is prepared can also be used to explore different features of the time series. As we will see later, pre-processing your dataset is an important step in evaluating model performance using cross-validation.

For our example, the data preparation has already been done in creating the global_economy tsibble.

Plot the data (visualise)

As we have seen in Chapter 2, visualisation is an essential step in understanding the data. Looking at your data allows you to identify common patterns, and subsequently specify an appropriate model.

The data for one country in our example are plotted in Figure 3.1.

global_economy %>%
  filter(Country=="Sweden") %>%
  autoplot(GDP) +
    ggtitle("GDP for Sweden") + ylab("$US billions")
GDP data for Sweden from 1960 to 2017.

Figure 3.1: GDP data for Sweden from 1960 to 2017.

Define a model (specify)

Before fitting a model to the data, we first must describe the model. There are many different time series models that can be used for forecasting, and much of this book is dedicated to describing various models. Specifying an appropriate model for the data is essential for producing appropriate forecasts.

Models in R are specified using model functions, which each use a formula (y ~ x) interface. The response variable(s) are specified on the left of the formula, and the structure of the model is written on the right.

For example, a linear model (discussed in Chapter 6) that models GDP using a linear trend can be specified with

TSLM(GDP ~ trend()).

In this case the model function is TSLM() (time series linear model), the response variable is GDP and it is being modelled using trend() (a “special” function specifying a linear trend). We will be taking a closer look at how each model can be specified in their respective sections.

The special functions used to define the model’s structure vary between models (as each model can support different structures). The “Specials” section of the documentation for each model function lists these special functions and how they can be used.

The left side of the formula also supports many transformations, which can be useful in simplifying the time series patterns or constraining the forecasts to be between specific values. When forecasting from a model with transformations, the appropriate back-transformation will be applied to ensure forecasts are on the correct scale. Some common transformations which can be used when modelling are shown in Section 3.3.

Train the model (estimate)

Once an appropriate model is specified, we next train the model on some data. One or more model specifications can be estimated using the model() function.

To estimate the model in our example, we use

fit <- global_economy %>%
  model(trend_model = TSLM(GDP ~ trend()))

This fits a linear trend model to the GDP data for each combination of key variables in the tsibble. In this example, it will fit a model to each of the 263 countries in the dataset. The resulting object is a model table or a “mable”.

fit
#> # A mable: 263 x 2
#> # Key:     Country [263]
#>    Country             trend_model
#>    <fct>               <model>    
#>  1 Afghanistan         <TSLM>     
#>  2 Albania             <TSLM>     
#>  3 Algeria             <TSLM>     
#>  4 American Samoa      <TSLM>     
#>  5 Andorra             <TSLM>     
#>  6 Angola              <TSLM>     
#>  7 Antigua and Barbuda <TSLM>     
#>  8 Arab World          <TSLM>     
#>  9 Argentina           <TSLM>     
#> 10 Armenia             <TSLM>     
#> # … with 253 more rows

Each row corresponds to one combination of the key variables. The trend_model column contains information about the fitted model for each country. In later chapters we will learn how to see more information about each model.

Check model performance (evaluate)

Once a model has been fitted, it is important to check how well it has performed on the data. There are several diagnostic tools available to check model behaviour, and also accuracy measures that allow one model to be compared against another. Chapter 4 goes into further details.

Produce forecasts (forecast)

With an appropriate model specified, estimated and checked, it is time to produce the forecasts using forecast(). The easiest way to use this function is by specifying the number of future observations to forecast. For example, forecasts for the next 10 observations can be generated using h = 10. We can also use natural language; e.g., h = "2 years" can be used to predict two years into the future.

In other situations, it may be more convenient to provide a dataset of future time periods to forecast. This is commonly required when your model uses additional information from the data, such as variables for exogenous regressors. Additional data required by the model can be included in the dataset of observations to forecast.

fit %>% forecast(h = "3 years")
#> # A fable: 789 x 5 [1Y]
#> # Key:     Country, .model [263]
#>    Country        .model       Year           GDP .distribution      
#>    <fct>          <chr>       <dbl>         <dbl> <dist>             
#>  1 Afghanistan    trend_model  2018  16205101654. N(1.6e+10, 1.3e+19)
#>  2 Afghanistan    trend_model  2019  16511878141. N(1.7e+10, 1.3e+19)
#>  3 Afghanistan    trend_model  2020  16818654627. N(1.7e+10, 1.3e+19)
#>  4 Albania        trend_model  2018  13733734164. N(1.4e+10, 3.9e+18)
#>  5 Albania        trend_model  2019  14166852711. N(1.4e+10, 3.9e+18)
#>  6 Albania        trend_model  2020  14599971258. N(1.5e+10, 3.9e+18)
#>  7 Algeria        trend_model  2018 157895153441. N(1.6e+11, 9.4e+20)
#>  8 Algeria        trend_model  2019 161100952126. N(1.6e+11, 9.4e+20)
#>  9 Algeria        trend_model  2020 164306750811. N(1.6e+11, 9.4e+20)
#> 10 American Samoa trend_model  2018    682475000  N(6.8e+08, 1.7e+15)
#> # … with 779 more rows

This is a forecasting table, or “fable”. Each row corresponds to one forecast period for each country. The GDP column contains the point forecast, while the .distribution column contains the forecasting distribution. The point forecast is the mean (or average) of the forecasting distribution.

The forecasts can be plotted along with the historical data using autoplot() as follows.

fit %>% forecast(h = "3 years") %>%
  filter(Country=="Sweden") %>%
  autoplot(global_economy) +
    ggtitle("GDP for Sweden") + ylab("$US billions")