5/​​5 Matrix formulation

Warn­ing: this is a more advanced optional sec­tion and assumes knowl­edge of matrix algebra.

The mul­ti­ple regres­sion model can be writ­ten as

    \[ y_{i} = \beta_{0} + \beta_{1} x_{1,i} + \beta_{2} x_{2,i} + \cdots + \beta_{k} x_{k,i} + e_{i}. \]

This expresses the rela­tion­ship between a sin­gle value of the fore­cast vari­able and the pre­dic­tors. It can be con­ve­nient to write this in matrix form where all the val­ues of the fore­cast vari­able are given in a sin­gle equa­tion. Let \bm{Y} = (y_{1},\dots,y_{n})', \bm{e} = (e_{1},\dots,e_{n})', \bm{\beta} = (\beta_{1},\dots,\beta_{k})' and

    \[ \bm{X} = \left[\begin{matrix} 1 & x_{1,1} & x_{2,1} & \dots & x_{k,1}\\ 1 & x_{1,2} & x_{2,2} & \dots & x_{k,2}\\ \vdots & \vdots & \vdots & & \vdots\\ 1 & x_{1,n} & x_{2,n} & \dots & x_{k,n} \end{matrix}\right]. \]

Then

    \[ \bm{Y} = \bm{X}\bm{\beta} + \bm{e}. \]

Least squares estimation

Least squares esti­ma­tion is obtained by min­i­miz­ing the expres­sion (\bm{Y} - \bm{X}\bm{\beta})'(\bm{Y} - \bm{X}\bm{\beta}). It can be shown that this is min­i­mized when \bm{\beta} takes the value

    \[ \hat{\bm{\beta}} = (\bm{X}'\bm{X})^{-1}\bm{X}'\bm{Y} \]

This is some­times known as the “nor­mal equa­tion”. The esti­mated coef­fi­cients require the inver­sion of the matrix \bm{X}'\bm{X}. If this matrix is sin­gu­lar, then the model can­not be esti­mated. This will occur, for exam­ple, if you fall for the “dummy vari­able trap” (hav­ing the same num­ber of dummy vari­ables as there are cat­e­gories of a cat­e­gor­i­cal predictor).

The resid­ual vari­ance is esti­mated using

    \[ \hat{\sigma}^2 = \frac{1}{n-k}(\bm{Y} - \bm{X}\hat{\bm{\beta}})'(\bm{Y} - \bm{X}\hat{\bm{\beta}}). \]

Fit­ted val­ues and cross-validation

The nor­mal equa­tion shows that the fit­ted val­ues can be cal­cu­lated using

    \[ \bm{\hat{Y}} = \bm{X}\hat{\bm{\beta}} = \bm{X}(\bm{X}'\bm{X})^{-1}\bm{X}'\bm{Y} = \bm{H}\bm{Y}, \]

where \bm{H} =  \bm{X}(\bm{X}'\bm{X})^{-1}\bm{X}' is known as the “hat-matrix” because it is used to com­pute \bm{\hat{Y}} (“Y-hat”).

If the diag­o­nal val­ues of \bm{H} are denoted by h_{1},\dots,h_{n}, then the cross-validation sta­tis­tic can be com­puted using

    \[ \text{CV} = \frac{1}{n}\sum_{i=1}^n [e_{i}/(1-h_{i})]^2, \]

where e_{i} is the resid­ual obtained from fit­ting the model to all n obser­va­tions. Thus, it is not nec­es­sary to actu­ally fit n sep­a­rate mod­els when com­put­ing the CV statistic.

Fore­casts

Let \bm{X}^* be a row vec­tor con­tain­ing the val­ues of the pre­dic­tors for the fore­casts (in the same for­mat as \bm{X}). Then the fore­cast is given by

    \[ \hat{y} = \bm{X}^*\hat{\bm{\beta}} = \bm{X}^*(\bm{X}'\bm{X})^{-1}\bm{X}'\bm{Y} \]

and its vari­ance by

    \[ \sigma^2 \left[1 + \bm{X}^* (\bm{X}'\bm{X})^{-1} (\bm{X}^*)'\right]. \]

Then a 95% pre­dic­tion inter­val can be cal­cu­lated (assum­ing nor­mally dis­trib­uted errors) as

    \[ \hat{y} \pm 1.96 \hat{\sigma} \sqrt{1 + \bm{X}^* (\bm{X}'\bm{X})^{-1} (\bm{X}^*)'}. \]

This takes account of the uncer­tainty due to the error term e and the uncer­tainty in the coef­fi­cient esti­mates. How­ever, it ignores any errors in \bm{X}^*. So if the future val­ues of the pre­dic­tors are uncer­tain, then the pre­dic­tion inter­val cal­cu­lated using this expres­sion will be too narrow.


Pro­ceed to Sec­tion 5/6.

Comments are closed.