5.1 A tidy forecasting workflow
The process of producing forecasts for time series data can be broken down into a few steps.
To illustrate the process, we will fit linear trend models to national GDP data stored in
Data preparation (tidy)
The first step in forecasting is to prepare data in the correct format. This process may involve loading in data, identifying missing values, filtering the time series and other pre-processing tasks. The functionality provided by tsibble and other packages in the tidyverse substantially simplifies this step.
Many models have different data requirements, some require the series to be in time order, others require no missing values. Checking your data is an essential step to understanding its features and it is useful to do before models are estimated.
The way in which your data is prepared can also be used to explore different features of the time series. As we will see later, pre-processing your dataset is an important step in evaluating model performance using cross-validation.
For our example, the data preparation has already been done in creating the
Plot the data (visualise)
As we have seen in Chapter 2, visualisation is an essential step in understanding the data. Looking at your data allows you to identify common patterns, and subsequently specify an appropriate model.
The data for one country in our example are plotted in Figure 5.1.
%>% global_economy filter(Country=="Sweden") %>% autoplot(GDP) + ggtitle("GDP for Sweden") + ylab("$US billions")
Define a model (specify)
Before fitting a model to the data, we first must describe the model. There are many different time series models that can be used for forecasting, and much of this book is dedicated to describing various models. Specifying an appropriate model for the data is essential for producing appropriate forecasts.
Models in R are specified using model functions, which each use a formula (
y ~ x) interface. The response variable(s) are specified on the left of the formula, and the structure of the model is written on the right.
For example, a linear model (to be discussed in Chapter 7) that models GDP using a linear trend can be specified with
TSLM(GDP ~ trend()).
In this case the model function is
TSLM() (time series linear model), the response variable is
GDP and it is being modelled using
trend() (a “special” function specifying a linear trend). We will be taking a closer look at how each model can be specified in their respective sections.
The special functions used to define the model’s structure vary between models (as each model can support different structures). The “Specials” section of the documentation for each model function lists these special functions and how they can be used.
The left side of the formula also supports the transformations discussed in Section 3.1, which can be useful in simplifying the time series patterns or constraining the forecasts to be between specific values.
Train the model (estimate)
Once an appropriate model is specified, we next train the model on some data. One or more model specifications can be estimated using the
To estimate the model in our example, we use
global_economy %>% fit <- model(trend_model = TSLM(GDP ~ trend()))
This fits a linear trend model to the GDP data for each combination of key variables in the tsibble. In this example, it will fit a model to each of the 263 countries in the dataset. The resulting object is a model table or a “mable”.
fit#> # A mable: 263 x 2 #> # Key: Country  #> Country trend_model #> <fct> <model> #> 1 Afghanistan <TSLM> #> 2 Albania <TSLM> #> 3 Algeria <TSLM> #> 4 American Samoa <TSLM> #> 5 Andorra <TSLM> #> 6 Angola <TSLM> #> 7 Antigua and Barbuda <TSLM> #> 8 Arab World <TSLM> #> 9 Argentina <TSLM> #> 10 Armenia <TSLM> #> # … with 253 more rows
Each row corresponds to one combination of the key variables. The
trend_model column contains information about the fitted model for each country. In later chapters we will learn how to see more information about each model.
Check model performance (evaluate)
Once a model has been fitted, it is important to check how well it has performed on the data. There are several diagnostic tools available to check model behaviour, and also accuracy measures that allow one model to be compared against another. Section 5.8 goes into further details.
Produce forecasts (forecast)
With an appropriate model specified, estimated and checked, it is time to produce the forecasts using
forecast(). The easiest way to use this function is by specifying the number of future observations to forecast. For example, forecasts for the next 10 observations can be generated using
h = 10. We can also use natural language; e.g.,
h = "2 years" can be used to predict two years into the future.
In other situations, it may be more convenient to provide a dataset of future time periods to forecast. This is commonly required when your model uses additional information from the data, such as variables for exogenous regressors. Additional data required by the model can be included in the dataset of observations to forecast.
%>% forecast(h = "3 years") fit #> # A fable: 789 x 5 [1Y] #> # Key: Country, .model  #> Country .model Year GDP .mean #> <fct> <chr> <dbl> <dist> <dbl> #> 1 Afghanistan trend_model 2018 N(1.6e+10, 1.3e+19) 16205101654. #> 2 Afghanistan trend_model 2019 N(1.7e+10, 1.3e+19) 16511878141. #> 3 Afghanistan trend_model 2020 N(1.7e+10, 1.3e+19) 16818654627. #> 4 Albania trend_model 2018 N(1.4e+10, 3.9e+18) 13733734164. #> 5 Albania trend_model 2019 N(1.4e+10, 3.9e+18) 14166852711. #> 6 Albania trend_model 2020 N(1.5e+10, 3.9e+18) 14599971258. #> 7 Algeria trend_model 2018 N(1.6e+11, 9.4e+20) 157895153441. #> 8 Algeria trend_model 2019 N(1.6e+11, 9.4e+20) 161100952126. #> 9 Algeria trend_model 2020 N(1.6e+11, 9.4e+20) 164306750811. #> 10 American Samoa trend_model 2018 N(6.8e+08, 1.7e+15) 682475000 #> # … with 779 more rows
This is a forecasting table, or “fable”. Each row corresponds to one forecast period for each country. The
GDP column contains the forecasting distribution, while the
.mean column contains the point forecast. The point forecast is the mean (or average) of the forecasting distribution.
The forecasts can be plotted along with the historical data using
autoplot() as follows.
%>% forecast(h = "3 years") %>% fit filter(Country=="Sweden") %>% autoplot(global_economy) + ggtitle("GDP for Sweden") + ylab("$US billions")