5.1 A tidy forecasting workflow
The process of producing forecasts for time series data can be broken down into a few steps.
To illustrate the process, we will fit linear trend models to national GDP data stored in
Data preparation (tidy)
The first step in forecasting is to prepare data in the correct format. This process may involve loading in data, identifying missing values, filtering the time series, and other pre-processing tasks. The functionality provided by
tsibble and other packages in the tidyverse substantially simplifies this step.
Many models have different data requirements; some require the series to be in time order, others require no missing values. Checking your data is an essential step to understanding its features and should always be done before models are estimated.
We will model GDP per capita over time; so first, we must compute the relevant variable.
<- global_economy %>% gdppc mutate(GDP_per_capita = GDP / Population)
Plot the data (visualise)
As we have seen in Chapter 2, visualisation is an essential step in understanding the data. Looking at your data allows you to identify common patterns, and subsequently specify an appropriate model.
The data for one country in our example are plotted in Figure 5.1.
%>% gdppc filter(Country == "Sweden") %>% autoplot(GDP_per_capita) + labs(y = "$US", title = "GDP per capita for Sweden")
Define a model (specify)
There are many different time series models that can be used for forecasting, and much of this book is dedicated to describing various models. Specifying an appropriate model for the data is essential for producing appropriate forecasts.
fable are specified using model functions, which each use a formula (
y ~ x) interface. The response variable(s) are specified on the left of the formula, and the structure of the model is written on the right.
For example, a linear trend model (to be discussed in Chapter 7) for GDP per capita can be specified with
TSLM(GDP_per_capita ~ trend()).
In this case the model function is
TSLM() (time series linear model), the response variable is
GDP_per_capita and it is being modelled using
trend() (a “special” function specifying a linear trend when it is used within
TSLM()). We will be taking a closer look at how each model can be specified in their respective sections.
The special functions used to define the model’s structure vary between models (as each model can support different structures). The “Specials” section of the documentation for each model function lists these special functions and how they can be used.
The left side of the formula also supports the transformations discussed in Section 3.1, which can be useful in simplifying the time series patterns or constraining the forecasts to be between specific values (see Section 13.3).
Train the model (estimate)
Once an appropriate model is specified, we next train the model on some data. One or more model specifications can be estimated using the
To estimate the model in our example, we use
<- gdppc %>% fit model(trend_model = TSLM(GDP_per_capita ~ trend()))
This fits a linear trend model to the GDP per capita data for each combination of key variables in the tsibble. In this example, it will fit a model to each of the 263 countries in the dataset. The resulting object is a model table or a “mable.”
fit#> # A mable: 263 x 2 #> # Key: Country  #> Country trend_model #> <fct> <model> #> 1 Afghanistan <TSLM> #> 2 Albania <TSLM> #> 3 Algeria <TSLM> #> 4 American Samoa <TSLM> #> 5 Andorra <TSLM> #> 6 Angola <TSLM> #> 7 Antigua and Barbuda <TSLM> #> 8 Arab World <TSLM> #> 9 Argentina <TSLM> #> 10 Armenia <TSLM> #> # … with 253 more rows
Each row corresponds to one combination of the key variables. The
trend_model column contains information about the fitted model for each country. In later chapters we will learn how to see more information about each model.
Check model performance (evaluate)
Once a model has been fitted, it is important to check how well it has performed on the data. There are several diagnostic tools available to check model behaviour, and also accuracy measures that allow one model to be compared against another. Sections 5.8 and 5.9 go into further details.
Produce forecasts (forecast)
With an appropriate model specified, estimated and checked, it is time to produce the forecasts using
forecast(). The easiest way to use this function is by specifying the number of future observations to forecast. For example, forecasts for the next 10 observations can be generated using
h = 10. We can also use natural language; e.g.,
h = "2 years" can be used to predict two years into the future.
In other situations, it may be more convenient to provide a dataset of future time periods to forecast. This is commonly required when your model uses additional information from the data, such as exogenous regressors. Additional data required by the model can be included in the dataset of observations to forecast.
%>% forecast(h = "3 years") fit #> # A fable: 789 x 5 [1Y] #> # Key: Country, .model  #> Country .model Year GDP_per_capita .mean #> <fct> <chr> <dbl> <dist> <dbl> #> 1 Afghanistan trend_model 2018 N(526, 9653) 526. #> 2 Afghanistan trend_model 2019 N(534, 9689) 534. #> 3 Afghanistan trend_model 2020 N(542, 9727) 542. #> 4 Albania trend_model 2018 N(4716, 476419) 4716. #> 5 Albania trend_model 2019 N(4867, 481086) 4867. #> 6 Albania trend_model 2020 N(5018, 486012) 5018. #> 7 Algeria trend_model 2018 N(4410, 643094) 4410. #> 8 Algeria trend_model 2019 N(4489, 645311) 4489. #> 9 Algeria trend_model 2020 N(4568, 647602) 4568. #> 10 American Samoa trend_model 2018 N(12491, 652926) 12491. #> # … with 779 more rows
This is a forecast table, or “fable.” Each row corresponds to one forecast period for each country. The
GDP_per_capita column contains the forecast distribution, while the
.mean column contains the point forecast. The point forecast is the mean (or average) of the forecast distribution.
The forecasts can be plotted along with the historical data using
autoplot() as follows.
%>% fit forecast(h = "3 years") %>% filter(Country == "Sweden") %>% autoplot(gdppc) + labs(y = "$US", title = "GDP per capita for Sweden")