Anomalies are observations that are unlikely to have come from the same distribution as the rest of the data. So one approach to identifying anomalies is to first estimate the probability distribution of the data, and then identify observations that are unlikely to have come from that distribution. This is the approach taken by density-based methods, where anomalies are defined as observations of low probability.

6.1 Density scores and the lookout algorithm

Any density-based method of anomaly detection first requires that we have a density estimate at each observation, which we will denote by f(\bm{y}_i). This may be an assumed density, or estimated from the data; it may be a parametric function, or a nonparametric estimate; it may be a conditional density or a marginal density. Wherever it comes from, we will assume that f represents a probability density function that is appropriate for the data set. When \bm{y}_i takes values on a discrete space (such as the integers), then f is a probability mass function, but we will call it a “density” for convenience when discussing methods that can apply to both discrete and continuous data.

The density score of an observation \bm{y}_i, is defined as s_i = -\log f(\bm{y}_i). So it is a measure of how anomalous that observation is, given the density f. A large value of s_i indicates that \bm{y}_i is not a likely value, and so is a potential anomaly. On the other hand, typical values will have low density scores.

Density scores are commonly used in forecasting and prediction problems, where they are used to assess whether an estimated distribution f provides a good description of the future values y(Gneiting and Katzfuss 2014). In that context, the data y are assumed to come from some distribution, and the density scores are used to assess how well the estimated distribution f matches that distribution. Here we are using density scores in reverse — we assume f is a good description of the data, and then we use the density scores to identify observations that are unlikely to have come from that distribution.

If the density f has been estimated from the data, then it is sometimes useful to consider the leave-one-out (LOO) estimate given by f_{-i}. That is, f_{-i} is the density estimate using all observations other than the ith observation. Then an unusual observation can’t influence the density estimate, giving us a better measure of how anomalous it is. Then we will call the associated density score, a “LOO density score”, given by s_i = -\log f_{-i}(\bm{y}_i).

The “lookout” algorithm (standing for Leave-One-Out Kernel density estimates for OUTlier detection) was proposed by Kandanaarachchi and Hyndman (2022) and uses density scores to find the probability of each observation being an anomaly. Although it was proposed using density scores from kernel density estimates, it can be applied to scores obtained using any density.

The underlying idea of the lookout algorithm is to use extreme value theory applied to the density scores to estimate the probability of each observation being an anomaly. We fit a Generalized Pareto Distribution to the density scores using the POT approach discussed in Section 2.8, with the 95^{\text{th}} percentile as the threshold. Then we apply the fitted distribution to the LOO density scores to obtain the probability of each observation. In some cases, leave-one-out scores are too difficult to compute, and then the fitted distribution is applied to the density scores. While this is not quite as accurate, it can still give a useful approximation, especially when there is a large number of observations.

In the following sections, we will discuss how to compute density scores for a variety of different density estimates, and then we will apply the lookout algorithm to those scores.

6.2 Linear regression

Suppose we want to find anomalies amongst n univariate observations y_1,\dots,y_n, and we have p variables that we think might be useful for predicting y. Then we can write the conditional density as f(y \mid \bm{x}), where \bm{x} is a p-dimensional vector of predictor variables. Anomalies in y are identified as observations that are unlikely to have come from the conditional density f. This is commonly called a “regression model”, regardless of the form of f, or whether the relationship with \bm{x} is linear or not.

By far the most common type of regression model assumes that f is a Normal distribution, and that the conditional mean is a linear function of \bm{x}. Note that this does not mean that y is Normally distributed, or that \bm{x} has a Normal distribution. The assumption is that the conditional distribution of y given \bm{x} is Normal, which can easily be checked by looking at the residuals from the regression model.

For a linear Normal regression model, with independent observations and homoscedastic errors, the conditional distribution is given by
y \mid \bm{x} \sim N(\bm{x}_+'\bm{\beta}, \sigma^2),
\tag{6.1} where \bm{x}_+ = [1, \bm{x}]' is a (p+1)-dimensional vector containing a 1 in the first position and the predictors in the remaining positions, and \bm{\beta} is a (p+1)-dimensional vector of regression coefficients.

Model estimation

The model can be written in matrix form as
\bm{y} \sim N(\bm{X}\bm{\beta}, \sigma^2\bm{I}),
where \bm{X} is an n\times(p+1) matrix with the first column being a vector of 1s, and the other columns containing the predictor variables, or equivalently as
\bm{\varepsilon} = \bm{y} - \bm{X}\bm{\beta} \sim N(\bm{0}, \sigma^2\bm{I}).
\tag{6.2} Provided \bm{X} is of rank p+1, and the errors \bm{\varepsilon} are independent of \bm{X}, the model can be estimated using ordinary least squares regression (Seber and Lee 2003), resulting in the estimate
\hat{\bm{\beta}} = (\bm{X}'\bm{X})^{-1}\bm{X}'\bm{y}.
The fitted values (i.e., predicted values for the training data) are given by
\hat{\bm{y}} = \bm{X}\hat{\bm{\beta}} = \bm{H}\bm{y},
where \bm{H} = \bm{X}(\bm{X}'\bm{X})^{-1}\bm{X}' is known as the “hat”-matrix because it creates the “y-hat” values \hat{\bm{y}} from the data \bm{y}.

The diagonals of \bm{H}, given by h_1,\dots,h_n, take values between 0 and 1. These are known as the “leverage” values (Faraway 2014, p69), and measure how much each observation influences the corresponding fitted value. High leverage values (close to 1) correspond to observations that have a large influence on the estimated coefficients, and so leaving those observations out will lead to very different values for the fitted values and residuals. On the other hand, small leverage values (close to 0) correspond to observations that have little influence on the estimated coefficients, and so leaving those observations out will lead to similar values for the fitted values and residuals.

Residuals

The residuals from the model are given by
\bm{e} = \bm{y} - \hat{\bm{y}} = (\bm{I} - \bm{H})\bm{y}.
\tag{6.3} Note that the residuals have the distribution \bm{e}\mid\bm{X} \sim N(\bm{0}, \sigma^2(\bm{I} - \bm{H})), which is not quite the same as the distribution of the errors given by Equation 6.2. However, the two distributions are asymptotically equivalent as n\rightarrow\infty. Often, we need standardized residuals, which are obtained by dividing each residual by its estimated standard deviation, giving r_i = \frac{e_i}{\hat{\sigma}\sqrt{1-h_i}}, \qquad i = 1,\dots, n,
where
\hat\sigma^2 = \frac{1}{n-p-1}\sum_{i=1}^n e_i^2
\tag{6.4} is the estimated residual variance.

A linear model can be estimated in R using the stats::lm() function. The broom::augment() function will compute the residuals (named .resid), the standardized residuals (named .std.resid), and the leverage values (names .hat).

The density scores under the Gaussian linear regression model Equation 6.1 can be estimated using these standardized residuals, giving
s_i = -\log\phi(r_i),
\tag{6.5} where \phi(u) = (2\pi)^{-1/2}e^{-u^2} is the standard normal density. This can be computed as follows, assuming that fit is the output from stats::lm().

Equivalently, the density_scores() function will compute them:

density_scores(fit)

LOO residuals

The leave-one-out residual for the ith observation is defined as the difference between \bm{y}_i and the predicted value obtained using a model fitted to all observations except the ith observation. At first, this appears to involve a lot of computation — estimating n separate models. However, the leave-one-out residuals are easily obtained from a linear regression model without actually having to re-estimate the model many times. It can be shown (Montgomery, Peck, and Vining 2012, Appendix C.7) that the leave-one-out (LOO) residuals are given by
e_{-i} = e_{i}/(1-h_{i}),
\tag{6.6} where e_{i} is the residual obtained from fitting the model to all observations. If we divide the LOO residuals by \hat\sigma from Equation 6.4, we obtain the “standardized” residuals.

In the context of anomaly detection, it often makes more sense to standardize each LOO residual by the standard deviation estimated from the model fitted to all other observations, rather than by \hat\sigma. If we leave out the ith observation, and fit a regression model to the remaining observations, then the estimated variance of the residuals is given by (Montgomery, Peck, and Vining 2012, Appendix C.8)
\hat\sigma_{-i}^2 = \frac{1}{n-p-2}\left[(n-p-1)\hat\sigma^2 - e_{i}^2/(1-h_i)\right].
\tag{6.7} These are computed by broom::augment() and the values of \hat\sigma_{-i} are returned in the column .sigma.

If we standardize each residual using \sigma_{-i}, we obtain the “studentized” residuals
r_{-i} = \frac{e_{i}}{\hat\sigma_{-i}\sqrt{1-h_i}},
\tag{6.8} from which we obtain the log-LOO regression scores given by
s_{-i} = -\log \phi(r_{-i})
\tag{6.9}

If fit is the output from stats::lm(), then these quantities can be computed as follows.

More simply, we can just use the density_scores() function again:

density_scores(fit, type ="loo")

Example: Shiraz reviews

For example, consider the wine reviews of Shiraz (aka Syrah), plotted in Figure 1.6. We can fit a linear regression model to these data to obtain a conditional density estimate of price given the points awarded to each wine. Then, \bm{X} contains just two columns: a column of 1s, and a column containing the points values. The vector \bm{y} contains the log prices of the wines. The model can be fitted as follows.

#>
#> Call:
#> lm(formula = log(price) ~ points, data = shiraz)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.6407 -0.3361 -0.0109 0.3052 2.9909
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6.05913 0.20731 -29.2 <2e-16 ***
#> points 0.10690 0.00232 46.0 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.485 on 4494 degrees of freedom
#> Multiple R-squared: 0.32, Adjusted R-squared: 0.32
#> F-statistic: 2.12e+03 on 1 and 4494 DF, p-value: <2e-16

The fitted model can be written as
\log(\text{Price}) \sim N(-6.059 + 0.107 \times \text{Points}, 0.485^2),
and is depicted in Figure 6.1 with 95% prediction intervals.

Code

wine_aug <- broom::augment(fit_wine, data = shiraz, interval ="prediction") |>mutate(lwr =exp(.lower),upr =exp(.upper),location =case_when( price < lwr ~"below", price > upr ~"above",TRUE~"within" ) )wine_aug |>ggplot(aes(y = price, x = points, col = location)) +geom_jitter(height =0, width =0.1, alpha =0.5) +geom_ribbon(aes(ymin = lwr, ymax = upr), fill ="#cccccc", alpha =0.25) +geom_line(aes(y =exp(.fitted)), color ="#666666") +scale_y_log10() +guides(fill ="none", col ="none") +scale_color_manual(values =c("#0072B2", "#D55E00", "#AAAAAA"))

The LOO density scores obtained from this model are shown in Figure 6.2, using the same colors as Figure 6.1 to indicate whether the observation is below, within, or above, the 95% prediction interval.

The over-priced wines under this model are shown in blue, while the under-priced wines are shown in orange. This shows that the most anomalous observations are the two with LOO density scores above 17, and studentized residuals close to 6. The largest LOO density score is for the most over-priced wine (under this model), a 2009 Shiraz from the Henschke winery in the Eden Valley region of South Australia, with 91 points and a price of $780.

#> # A tibble: 1 × 9
#> country state region winery year points price Studentized_residuals
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Australia South Aust… Eden … Hensc… 2009 91 780 6.20
#> # ℹ 1 more variable: LOO_fscores <dbl>

The largest LOO density score corresponding to an under-priced wine is for the wine with the lowest residual value, with 85 points and a price of $4. Another good buy, at the higher quality end, is the 2007 Syrah from the Rulo winery in the Columbia Valley in Washington State, USA:

#> # A tibble: 1 × 9
#> country state region winery year points price Studentized_residuals
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 US Washington Columbia… Rulo 2007 96 20 -2.49
#> # ℹ 1 more variable: LOO_fscores <dbl>

This corresponds to the only orange point in Figure 6.1 that has a point value above 95 and a price below the 95% prediction interval.

Lookout probabilities

We will apply the lookout algorithm to these scores. First we compute the threshold value equal to the 95th percentile of the density scores, and then fit a generalized Pareto distribution to the scores above this threshold.

#> # A tibble: 4,496 × 9
#> country state region winery year points price LOO_fscores lookout_prob
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Australia South … Eden … Hensc… 2009 91 780 20.1 0.00566
#> 2 US Califo… Centr… Law 2013 92 750 18.3 0.00693
#> 3 Australia South … Eden … Hensc… 2010 96 820 14.4 0.0116
#> 4 Australia South … South… Penfo… 2008 98 850 12.5 0.0156
#> 5 Italy Tuscany Tosca… Tua R… 2011 89 300 11.7 0.0181
#> 6 Australia South … South… Penfo… 2010 99 850 11.5 0.0189
#> 7 Italy Tuscany Tosca… Tua R… 2013 90 300 10.7 0.0221
#> 8 Italy Tuscany Tosca… Tua R… 2012 90 300 10.7 0.0221
#> 9 France Rhône … Côte … Domai… 2013 93 391 10.2 0.0247
#> 10 Australia South … South… Penfo… 2004 96 500 9.57 0.0287
#> # ℹ 4,486 more rows

Those with the ten smallest lookout probabilities are all wines that appear to be over-priced given their points values.

The above code was introduced to illustrate each step of the procedure, but it is simpler to use the lookout function. This can be applied directly to the fitted model object, as follows.

#> # A tibble: 4,496 × 8
#> country state region winery year points price lookout_prob
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Australia South Australia Eden Vall… Hensc… 2009 91 780 0.00566
#> 2 US California Central C… Law 2013 92 750 0.00693
#> 3 Australia South Australia Eden Vall… Hensc… 2010 96 820 0.0116
#> 4 Australia South Australia South Aus… Penfo… 2008 98 850 0.0156
#> 5 Italy Tuscany Toscana Tua R… 2011 89 300 0.0181
#> 6 Australia South Australia South Aus… Penfo… 2010 99 850 0.0189
#> 7 Italy Tuscany Toscana Tua R… 2013 90 300 0.0221
#> 8 Italy Tuscany Toscana Tua R… 2012 90 300 0.0221
#> 9 France Rhône Valley Côte Rôtie Domai… 2013 93 391 0.0247
#> 10 Australia South Australia South Aus… Penfo… 2004 96 500 0.0287
#> # ℹ 4,486 more rows

Figure 6.3 shows the relationship between points and price, with the points colored according to the lookout probability. The six blue points are observations with lookout probabilities less than 0.02.

Code

shiraz |>ggplot(aes(x = points, y = price, color = lookout_prob <0.02)) +geom_jitter(height =0, width =0.2) +scale_y_log10()

6.3 GAM density scores

In some applications, it is not appropriate to assume the conditional density is Gaussian, or that the relationships are linear. One useful model that allows for non-Gaussian densities, and non-linear relationships, is a generalized additive model or GAM. Under this model, the conditional density is given by (Wood 2017)
y\mid\bm{x} \sim f(\mu), \qquad \ell(\mu) = \sum_{k=1}^p g_k(x_{k}),
where \mu = \text{E}(y | \bm{x}) denotes the conditional mean, \ell() is a link function, and each g_k function is smooth. If f is Normal, \ell is the identity, and g_i(u) = \beta_i u, then this reduces to the linear Gaussian model (Equation 6.1).

Consider the number of “not outs” for each batter in the cricket_batting data set. A “not out” occurs when a batsman has not been dismissed at the end of the team’s innings. Let’s consider if there are some batters who have an unusually high proportion of not outs. The data set contains results from 91022 innings, of which 11883 were not outs. So the overall proportion of not outs is 11883 / 91022 = 0.131.

Figure 6.4 shows the proportion of not outs for each batter as a function of the number of innings they played. There is some overplotting that occurs due to batters having the same numbers of not-outs and innings, which results in the higher color density of the corresponding plotted points. The unusual structure on the left of each plot is due to the discrete nature of the data. Batters who have played only a smaller number of innings tend to have a higher proportion of not outs on average, and a higher variance, than those who have played a large number of innings.

This suggests that we can construct a GAM for the number of not outs for each batter as a function of the number of innings they played. It is natural to use a Binomial distribution with a logit link function:
\text{NotOuts} \mid \text{Innings} \sim \text{Binomial}(n=\text{Innings},~ p),
where
\log(p / (1- p)) = g(\text{Innings})
We can fit this model using the mgcv package.

fit_notouts <- mgcv::gam(prop_no ~s(Innings), data = df_no,family =binomial(link = logit), weights = Innings)notouts_aug <- broom::augment(fit_notouts, data = df_no, type.predict ="response")notouts_aug |>ggplot(aes(x = Innings, y = prop_no)) +geom_point(alpha =0.2) +geom_line(aes(y = .fitted), color ="#0072B2") +geom_ribbon(aes(ymin = .fitted -2*.se.fit,ymax = .fitted +2*.se.fit),fill ="#0072B2", alpha =0.2) +labs(y ="Proportion of not outs")

Now we can use the fitted model to compute the density scores from the Binomial distribution, and find the most anomalous batters. Unfortunately, there is not a convenient way to compute loo density scores for GAM models, so we will only consider density scores in this example.

#> # A tibble: 3,702 × 8
#> Player Country Innings NotOuts prop_no .fitted fscores lookout
#> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 JM Anderson England 233 99 0.425 0.120 71.3 0.00869
#> 2 CS Martin New Zeala… 104 52 0.5 0.121 47.1 0.0185
#> 3 RGD Willis England 128 55 0.430 0.121 40.8 0.0240
#> 4 CA Walsh West Indi… 185 61 0.330 0.0965 40.7 0.0241
#> 5 TA Boult New Zeala… 86 42 0.488 0.121 37.1 0.0286
#> 6 M Muralitharan ICC/Sri L… 164 56 0.341 0.102 36.8 0.0290
#> 7 EJ Chatfield New Zeala… 54 33 0.611 0.131 36.0 0.0301
#> 8 BS Chandrasekhar India 80 39 0.488 0.123 34.1 0.0334
#> 9 GD McGrath Australia 138 51 0.370 0.118 31.8 0.0379
#> 10 M Ntini South Afr… 116 45 0.388 0.122 29.1 0.0447
#> # ℹ 3,692 more rows

The most anomalous batters are all “tail-enders” (i.e., not skilled batters) who played for a long time (so they have a large number of innings). Because they batted last, or nearly last, they are more likely to be not out at the end of the team’s innings.

The .fitted value is the expected proportion of not outs for each player given the number of innings they have played, while prop_no gives the actual proportion of not outs they have had. The largest density score is for English batter Jimmy Anderson, who has had 99 not outs in 233 innings, which is much higher than the expected number of not outs of 233 \times 0.120 = 27.9. This anomaly is also seen in Figure 6.5, as being somewhat unusual for that part of the data. Although Jimmy Anderson was not a great batter, he was good at defence, and was able to bat for a long time without being dismissed, leaving the other batter time to score runs.

We have identified an anomaly that is not anomalous in the proportion of not-outs, or in the number of innings, and the difference between the actual proportion and the predicted proportion is not anomalous either compared to some of the other values. However, because we have used a statistical model, we have been able to account for the particular features of this data set, such as the discrete nature of the data, and the changing variance, to identify an observation that is anomalous in the context of the model.

6.4 KDE scores

Suppose, instead of a regression or a GAM, we estimate f using a kernel density estimate. Then we call the resulting density scores “kde scores”. The kernel density estimate at each observation is (Equation 3.3)
f_i = \hat{f}(\bm{y}_i) = \frac{1}{n} \sum_{j=1}^n K_H(\bm{y}_i-\bm{y}_j),
\tag{6.10} and so the “kde score” at each observation as
s_i = -\log(f_i).
The largest possible score occurs when an observation has no other observations nearby. Then f_i \approx K_H(\bm{0})/n because K_H(\bm{y}_i-\bm{y}_j)\approx 0 when \|\bm{y}_i-\bm{y}_j\| is large. So the largest possible kde score, when using a Gaussian kernel, is
-\log(K_H(\bm{0})/n) \approx \log(n) + \frac{d}{2}\log(2\pi) + \frac{1}{2}\text{log det}(\bm{H}),
where \bm{H} is now the bandwidth matrix. For univariate data, when d=1, this simplifies to
-\log(K_h(0)/n) \approx \log(nh\sqrt{2\pi}).

Leave-one-out kde scores

The contribution of the ith point to the kernel density estimate at that point is K_H(\bm{0})/n. Therefore, we can compute leave-one-out kde scores as
f_{-i} = \left[nf_i - K_H(\bm{0})\right]/(n-1),
\tag{6.11} where f_i is the kde estimate at \bm{y}_i using all data. Thus, we can compute the leave-one-out kernel density scores without needing to re-estimate the density many times.

Example: Old Faithful eruption durations

For the Old Faithful eruption duration data, we obtain the following LOO kde scores.

of_scores <- oldfaithful |>mutate(score =density_scores(duration, loo =FALSE),loo_score =density_scores(duration, loo =TRUE),lookout =lookout(duration) )of_scores |>arrange(desc(loo_score))

The two infinite LOO scores correspond to the extreme 2 hour duration, and the tiny 1 second duration. These are so improbable given the rest of the data, that the scores are effectively infinite. The regular kde scores (with loo = FALSE) are at their maximum values.

Figure 6.6 shows an HDR boxplot of the data (other than the maximum), with those points identified as lookout anomalies highlighted in red. Notice that we can simply add the argument show_lookout = TRUE in order to highlight points with lookout probabilities less than 0.05.

#> # A tibble: 1 × 3
#> Player Average lookout
#> <chr> <dbl> <dbl>
#> 1 DG Bradman 99.9 0.000193

Here Bradman is a clear anomaly (with a very low lookout probability), and no-one else is identified as a possible anomaly.

The same algorithm is easily applied in two dimensions. Here we use the oldfaithful data set, and consider the distribution of Duration and Waiting time, ignoring those observations that are greater than 2 hours in either dimension.

of <- oldfaithful |>select(duration, waiting) |>filter(duration <7200, waiting <7200)of |>mutate(loo_scores =density_scores(of, loo =TRUE),lookout =lookout(of) ) |>filter(lookout <0.05) |>arrange(lookout, duration)

Now, six anomalies are identified, with two of them having infinite LOO density scores. We can visualize them in an HDR scatterplot, shown in Figure 6.7.

Code

of |>gg_hdrboxplot(duration, waiting, scatterplot =TRUE, show_lookout =TRUE)

Next we consider some artificial examples. First, we consider the first 48 rows of the second variable in the n01 data, along with the values 4.0 and 4.5.

The algorithm has found a small number of spurious anomalies in each case, out of the 1000 observations included. Notably, the results do not appear to deteriorate with the heavier-tailed or skewed distributions.

6.5 Other density-based methods

Faraway, J J. 2014. Linear Models with r. 2nd ed. Chapman; Hall/CRC.