6  Density-based methods

Anomalies are observations that are unlikely to have come from the same distribution as the rest of the data. So one approach to identifying anomalies is to first estimate the probability distribution of the data, and then identify observations that are unlikely to have come from that distribution. This is the approach taken by density-based methods, where anomalies are defined as observations of low probability.

6.1 Density scores and the lookout algorithm

Any density-based method of anomaly detection first requires that we have a density estimate at each observation, which we will denote by f(\bm{y}_i). This may be an assumed density, or estimated from the data; it may be a parametric function, or a nonparametric estimate; it may be a conditional density or a marginal density. Wherever it comes from, we will assume that f represents a probability density function that is appropriate for the data set. When \bm{y}_i takes values on a discrete space (such as the integers), then f is a probability mass function, but we will call it a “density” for convenience when discussing methods that can apply to both discrete and continuous data.

The density score of an observation \bm{y}_i, is defined as s_i = -\log f(\bm{y}_i). So it is a measure of how anomalous that observation is, given the density f. A large value of s_i indicates that \bm{y}_i is not a likely value, and so is a potential anomaly. On the other hand, typical values will have low density scores.

Density scores are commonly used in forecasting and prediction problems, where they are used to assess whether an estimated distribution f provides a good description of the future values y (Gneiting and Katzfuss 2014). In that context, the data y are assumed to come from some distribution, and the density scores are used to assess how well the estimated distribution f matches that distribution. Here we are using density scores in reverse — we assume f is a good description of the data, and then we use the density scores to identify observations that are unlikely to have come from that distribution.

If the density f has been estimated from the data, then it is sometimes useful to consider the leave-one-out (LOO) estimate given by f_{-i}. That is, f_{-i} is the density estimate using all observations other than the ith observation. Then an unusual observation can’t influence the density estimate, giving us a better measure of how anomalous it is. Then we will call the associated density score, a “LOO density score”, given by s_i = -\log f_{-i}(\bm{y}_i).

The “lookout” algorithm (standing for Leave-One-Out Kernel density estimates for OUTlier detection) was proposed by Kandanaarachchi and Hyndman (2022) and uses density scores to find the probability of each observation being an anomaly. Although it was proposed using density scores from kernel density estimates, it can be applied to scores obtained using any density.

The underlying idea of the lookout algorithm is to use extreme value theory applied to the density scores to estimate the probability of each observation being an anomaly. We fit a Generalized Pareto Distribution to the density scores using the POT approach discussed in Section 2.8, with the 95^{\text{th}} percentile as the threshold. Then we apply the fitted distribution to the LOO density scores to obtain the probability of each observation. In some cases, leave-one-out scores are too difficult to compute, and then the fitted distribution is applied to the density scores. While this is not quite as accurate, it can still give a useful approximation, especially when there is a large number of observations.

In the following sections, we will discuss how to compute density scores for a variety of different density estimates, and then we will apply the lookout algorithm to those scores.

6.2 Linear regression

Suppose we want to find anomalies amongst n univariate observations y_1,\dots,y_n, and we have p variables that we think might be useful for predicting y. Then we can write the conditional density as f(y \mid \bm{x}), where \bm{x} is a p-dimensional vector of predictor variables. Anomalies in y are identified as observations that are unlikely to have come from the conditional density f. This is commonly called a “regression model”, regardless of the form of f, or whether the relationship with \bm{x} is linear or not.

By far the most common type of regression model assumes that f is a Normal distribution, and that the conditional mean is a linear function of \bm{x}. Note that this does not mean that y is Normally distributed, or that \bm{x} has a Normal distribution. The assumption is that the conditional distribution of y given \bm{x} is Normal, which can easily be checked by looking at the residuals from the regression model.

For a linear Normal regression model, with independent observations and homoscedastic errors, the conditional distribution is given by y \mid \bm{x} \sim N(\bm{x}_+'\bm{\beta}, \sigma^2), \tag{6.1} where \bm{x}_+ = [1, \bm{x}]' is a (p+1)-dimensional vector containing a 1 in the first position and the predictors in the remaining positions, and \bm{\beta} is a (p+1)-dimensional vector of regression coefficients.

Model estimation

The model can be written in matrix form as \bm{y} \sim N(\bm{X}\bm{\beta}, \sigma^2\bm{I}), where \bm{X} is an n\times(p+1) matrix with the first column being a vector of 1s, and the other columns containing the predictor variables, or equivalently as \bm{\varepsilon} = \bm{y} - \bm{X}\bm{\beta} \sim N(\bm{0}, \sigma^2\bm{I}). \tag{6.2} Provided \bm{X} is of rank p+1, and the errors \bm{\varepsilon} are independent of \bm{X}, the model can be estimated using ordinary least squares regression (Seber and Lee 2003), resulting in the estimate \hat{\bm{\beta}} = (\bm{X}'\bm{X})^{-1}\bm{X}'\bm{y}. The fitted values (i.e., predicted values for the training data) are given by \hat{\bm{y}} = \bm{X}\hat{\bm{\beta}} = \bm{H}\bm{y}, where \bm{H} = \bm{X}(\bm{X}'\bm{X})^{-1}\bm{X}' is known as the “hat”-matrix because it creates the “y-hat” values \hat{\bm{y}} from the data \bm{y}.

The diagonals of \bm{H}, given by h_1,\dots,h_n, take values between 0 and 1. These are known as the “leverage” values (Faraway 2014, p69), and measure how much each observation influences the corresponding fitted value. High leverage values (close to 1) correspond to observations that have a large influence on the estimated coefficients, and so leaving those observations out will lead to very different values for the fitted values and residuals. On the other hand, small leverage values (close to 0) correspond to observations that have little influence on the estimated coefficients, and so leaving those observations out will lead to similar values for the fitted values and residuals.

Residuals

The residuals from the model are given by \bm{e} = \bm{y} - \hat{\bm{y}} = (\bm{I} - \bm{H})\bm{y}. \tag{6.3} Note that the residuals have the distribution \bm{e}\mid\bm{X} \sim N(\bm{0}, \sigma^2(\bm{I} - \bm{H})), which is not quite the same as the distribution of the errors given by Equation 6.2. However, the two distributions are asymptotically equivalent as n\rightarrow\infty. Often, we need standardized residuals, which are obtained by dividing each residual by its estimated standard deviation, giving r_i = \frac{e_i}{\hat{\sigma}\sqrt{1-h_i}}, \qquad i = 1,\dots, n, where \hat\sigma^2 = \frac{1}{n-p-1}\sum_{i=1}^n e_i^2 \tag{6.4} is the estimated residual variance.

A linear model can be estimated in R using the stats::lm() function. The broom::augment() function will compute the residuals (named .resid), the standardized residuals (named .std.resid), and the leverage values (names .hat).

The density scores under the Gaussian linear regression model Equation 6.1 can be estimated using these standardized residuals, giving s_i = -\log\phi(r_i), \tag{6.5} where \phi(u) = (2\pi)^{-1/2}e^{-u^2} is the standard normal density. This can be computed as follows, assuming that fit is the output from stats::lm().

broom::augment(fit) |>
  mutate(f_scores = -dnorm(.std.resid, log = TRUE))

Equivalently, the density_scores() function will compute them:

density_scores(fit)

LOO residuals

The leave-one-out residual for the ith observation is defined as the difference between \bm{y}_i and the predicted value obtained using a model fitted to all observations except the ith observation. At first, this appears to involve a lot of computation — estimating n separate models. However, the leave-one-out residuals are easily obtained from a linear regression model without actually having to re-estimate the model many times. It can be shown (Montgomery, Peck, and Vining 2012, Appendix C.7) that the leave-one-out (LOO) residuals are given by e_{-i} = e_{i}/(1-h_{i}), \tag{6.6} where e_{i} is the residual obtained from fitting the model to all observations. If we divide the LOO residuals by \hat\sigma from Equation 6.4, we obtain the “standardized” residuals.

In the context of anomaly detection, it often makes more sense to standardize each LOO residual by the standard deviation estimated from the model fitted to all other observations, rather than by \hat\sigma. If we leave out the ith observation, and fit a regression model to the remaining observations, then the estimated variance of the residuals is given by (Montgomery, Peck, and Vining 2012, Appendix C.8) \hat\sigma_{-i}^2 = \frac{1}{n-p-2}\left[(n-p-1)\hat\sigma^2 - e_{i}^2/(1-h_i)\right]. \tag{6.7} These are computed by broom::augment() and the values of \hat\sigma_{-i} are returned in the column .sigma.

If we standardize each residual using \sigma_{-i}, we obtain the “studentized” residuals r_{-i} = \frac{e_{i}}{\hat\sigma_{-i}\sqrt{1-h_i}}, \tag{6.8} from which we obtain the log-LOO regression scores given by s_{-i} = -\log \phi(r_{-i}) \tag{6.9}

If fit is the output from stats::lm(), then these quantities can be computed as follows.

broom::augment(fit) |>
  mutate(
    studentized_residuals = .resid / (.sigma * sqrt(1 - .hat)),
    loo_fscores = -log(dnorm(studentized_residuals, log = TRUE))
  )

More simply, we can just use the density_scores() function again:

density_scores(fit, type = "loo")

Example: Shiraz reviews

For example, consider the wine reviews of Shiraz (aka Syrah), plotted in Figure 1.6. We can fit a linear regression model to these data to obtain a conditional density estimate of price given the points awarded to each wine. Then, \bm{X} contains just two columns: a column of 1s, and a column containing the points values. The vector \bm{y} contains the log prices of the wines. The model can be fitted as follows.

shiraz <- wine_reviews |> filter(variety %in% c("Shiraz", "Syrah"))
fit_wine <- lm(log(price) ~ points, data = shiraz)
summary(fit_wine)
#> 
#> Call:
#> lm(formula = log(price) ~ points, data = shiraz)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.6407 -0.3361 -0.0109  0.3052  2.9909 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -6.05913    0.20731   -29.2   <2e-16 ***
#> points       0.10690    0.00232    46.0   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.485 on 4494 degrees of freedom
#> Multiple R-squared:  0.32,   Adjusted R-squared:  0.32 
#> F-statistic: 2.12e+03 on 1 and 4494 DF,  p-value: <2e-16

The fitted model can be written as \log(\text{Price}) \sim N(-6.059 + 0.107 \times \text{Points}, 0.485^2), and is depicted in Figure 6.1 with 95% prediction intervals.

Code
wine_aug <- broom::augment(fit_wine, data = shiraz, interval = "prediction") |>
  mutate(
    lwr = exp(.lower),
    upr = exp(.upper),
    location = case_when(
      price < lwr ~ "below",
      price > upr ~ "above",
      TRUE ~ "within"
    )
  )
wine_aug |>
  ggplot(aes(y = price, x = points, col = location)) +
  geom_jitter(height = 0, width = 0.1, alpha = 0.5) +
  geom_ribbon(aes(ymin = lwr, ymax = upr), fill = "#cccccc", alpha = 0.25) +
  geom_line(aes(y = exp(.fitted)), color = "#666666") +
  scale_y_log10() +
  guides(fill = "none", col = "none") +
  scale_color_manual(values = c("#0072B2", "#D55E00", "#AAAAAA"))
Figure 6.1: Log price of Shiraz as a function of points, with 95% prediction intervals. The points are horizontally jitted to reduce overplotting. Points outside the prediction intervals are colored.

The LOO density scores obtained from this model are shown in Figure 6.2, using the same colors as Figure 6.1 to indicate whether the observation is below, within, or above, the 95% prediction interval.

Code
wine_aug <- wine_aug |>
  mutate(
    Studentized_residuals = .resid / (.sigma * sqrt(1 - .hat)),
    fscores = -dnorm(.std.resid, log = TRUE),
    LOO_fscores = -dnorm(Studentized_residuals, log = TRUE)
  )
wine_aug |>
  select(points, Studentized_residuals, LOO_fscores, location) |>
  pivot_longer(c(Studentized_residuals, LOO_fscores), names_to = "variable", values_to = "value") |>
  mutate(variable = factor(variable, levels = c("Studentized_residuals", "LOO_fscores"))) |>
  ggplot(aes(x = points, y = value, col = location)) +
  facet_grid(variable ~ ., scales = "free_y") +
  geom_jitter(height = 0, width = 0.1, alpha = 0.5) +
  geom_hline(yintercept = 0, color = "#666666") +
  labs(x = "Points", y = "") +
  guides(fill = "none", col = "none") +
  scale_color_manual(values = c("#0072B2", "#D55E00", "#AAAAAA"))
Figure 6.2: Residuals and density scores for the Shiraz data using a linear regression model. Points are colored to match the 95% prediction intervals in Figure 6.1.

The over-priced wines under this model are shown in blue, while the under-priced wines are shown in orange. This shows that the most anomalous observations are the two with LOO density scores above 17, and studentized residuals close to 6. The largest LOO density score is for the most over-priced wine (under this model), a 2009 Shiraz from the Henschke winery in the Eden Valley region of South Australia, with 91 points and a price of $780.

wine_aug |>
  filter(LOO_fscores == max(LOO_fscores)) |>
  select(country:winery, year, points, price, Studentized_residuals, LOO_fscores)
#> # A tibble: 1 × 9
#>   country   state       region winery  year points price Studentized_residuals
#>   <chr>     <chr>       <chr>  <chr>  <dbl>  <dbl> <dbl>                 <dbl>
#> 1 Australia South Aust… Eden … Hensc…  2009     91   780                  6.20
#> # ℹ 1 more variable: LOO_fscores <dbl>

The largest LOO density score corresponding to an under-priced wine is for the wine with the lowest residual value, with 85 points and a price of $4. Another good buy, at the higher quality end, is the 2007 Syrah from the Rulo winery in the Columbia Valley in Washington State, USA:

wine_aug |>
  filter(points > 95) |>
  filter(Studentized_residuals == min(Studentized_residuals)) |>
  select(country:winery, year, points:price, Studentized_residuals, LOO_fscores)
#> # A tibble: 1 × 9
#>   country state      region    winery  year points price Studentized_residuals
#>   <chr>   <chr>      <chr>     <chr>  <dbl>  <dbl> <dbl>                 <dbl>
#> 1 US      Washington Columbia… Rulo    2007     96    20                 -2.49
#> # ℹ 1 more variable: LOO_fscores <dbl>

This corresponds to the only orange point in Figure 6.1 that has a point value above 95 and a price below the 95% prediction interval.

Lookout probabilities

We will apply the lookout algorithm to these scores. First we compute the threshold value equal to the 95th percentile of the density scores, and then fit a generalized Pareto distribution to the scores above this threshold.

threshold <- quantile(wine_aug$fscores, prob = 0.95, type = 8)
gpd <- evd::fpot(wine_aug$fscores, threshold = threshold)$estimate

Now we use the fitted model to compute the lookout probabilities from the LOO scores for each observation.

wine_aug |>
  mutate(
    lookout_prob = evd::pgpd(LOO_fscores, loc = threshold,
      scale = gpd["scale"], shape = gpd["shape"], lower.tail = FALSE
    )
  ) |>
  select(country:winery, year, points, price, LOO_fscores, lookout_prob) |>
  arrange(lookout_prob)
#> # A tibble: 4,496 × 9
#>    country   state   region winery  year points price LOO_fscores lookout_prob
#>    <chr>     <chr>   <chr>  <chr>  <dbl>  <dbl> <dbl>       <dbl>        <dbl>
#>  1 Australia South … Eden … Hensc…  2009     91   780       20.1       0.00566
#>  2 US        Califo… Centr… Law     2013     92   750       18.3       0.00693
#>  3 Australia South … Eden … Hensc…  2010     96   820       14.4       0.0116 
#>  4 Australia South … South… Penfo…  2008     98   850       12.5       0.0156 
#>  5 Italy     Tuscany Tosca… Tua R…  2011     89   300       11.7       0.0181 
#>  6 Australia South … South… Penfo…  2010     99   850       11.5       0.0189 
#>  7 Italy     Tuscany Tosca… Tua R…  2013     90   300       10.7       0.0221 
#>  8 Italy     Tuscany Tosca… Tua R…  2012     90   300       10.7       0.0221 
#>  9 France    Rhône … Côte … Domai…  2013     93   391       10.2       0.0247 
#> 10 Australia South … South… Penfo…  2004     96   500        9.57      0.0287 
#> # ℹ 4,486 more rows

Those with the ten smallest lookout probabilities are all wines that appear to be over-priced given their points values.

The above code was introduced to illustrate each step of the procedure, but it is simpler to use the lookout function. This can be applied directly to the fitted model object, as follows.

shiraz <- shiraz |>
  mutate(lookout_prob = lookout(fit_wine))
shiraz |>
  select(country:winery, year, points, price, lookout_prob) |>
  arrange(lookout_prob)
#> # A tibble: 4,496 × 8
#>    country   state           region     winery  year points price lookout_prob
#>    <chr>     <chr>           <chr>      <chr>  <dbl>  <dbl> <dbl>        <dbl>
#>  1 Australia South Australia Eden Vall… Hensc…  2009     91   780      0.00566
#>  2 US        California      Central C… Law     2013     92   750      0.00693
#>  3 Australia South Australia Eden Vall… Hensc…  2010     96   820      0.0116 
#>  4 Australia South Australia South Aus… Penfo…  2008     98   850      0.0156 
#>  5 Italy     Tuscany         Toscana    Tua R…  2011     89   300      0.0181 
#>  6 Australia South Australia South Aus… Penfo…  2010     99   850      0.0189 
#>  7 Italy     Tuscany         Toscana    Tua R…  2013     90   300      0.0221 
#>  8 Italy     Tuscany         Toscana    Tua R…  2012     90   300      0.0221 
#>  9 France    Rhône Valley    Côte Rôtie Domai…  2013     93   391      0.0247 
#> 10 Australia South Australia South Aus… Penfo…  2004     96   500      0.0287 
#> # ℹ 4,486 more rows

Figure 6.3 shows the relationship between points and price, with the points colored according to the lookout probability. The six blue points are observations with lookout probabilities less than 0.02.

Code
shiraz |>
  ggplot(aes(x = points, y = price, color = lookout_prob < 0.02)) +
   geom_jitter(height = 0, width = 0.2) +
   scale_y_log10()
Figure 6.3: Price vs points for Shiraz wines. Points are colored according to the lookout probability. The six blue points are observations with lookout probabilities less than 0.02.

6.3 GAM density scores

In some applications, it is not appropriate to assume the conditional density is Gaussian, or that the relationships are linear. One useful model that allows for non-Gaussian densities, and non-linear relationships, is a generalized additive model or GAM. Under this model, the conditional density is given by (Wood 2017) y\mid\bm{x} \sim f(\mu), \qquad \ell(\mu) = \sum_{k=1}^p g_k(x_{k}), where \mu = \text{E}(y | \bm{x}) denotes the conditional mean, \ell() is a link function, and each g_k function is smooth. If f is Normal, \ell is the identity, and g_i(u) = \beta_i u, then this reduces to the linear Gaussian model (Equation 6.1).

Consider the number of “not outs” for each batter in the cricket_batting data set. A “not out” occurs when a batsman has not been dismissed at the end of the team’s innings. Let’s consider if there are some batters who have an unusually high proportion of not outs. The data set contains results from 91022 innings, of which 11883 were not outs. So the overall proportion of not outs is 11883 / 91022 = 0.131.

Figure 6.4 shows the proportion of not outs for each batter as a function of the number of innings they played. There is some overplotting that occurs due to batters having the same numbers of not-outs and innings, which results in the higher color density of the corresponding plotted points. The unusual structure on the left of each plot is due to the discrete nature of the data. Batters who have played only a smaller number of innings tend to have a higher proportion of not outs on average, and a higher variance, than those who have played a large number of innings.

df_no <- cricket_batting |>
  filter(Innings > 0) |>
  mutate(prop_no = NotOuts / Innings)
df_no |>
  ggplot(aes(x = Innings, y = NotOuts/Innings)) +
  geom_point(alpha = 0.15)
Figure 6.4: Proportion of not outs for each batter as a function of the number of innings they played.

This suggests that we can construct a GAM for the number of not outs for each batter as a function of the number of innings they played. It is natural to use a Binomial distribution with a logit link function: \text{NotOuts} \mid \text{Innings} \sim \text{Binomial}(n=\text{Innings},~ p), where \log(p / (1- p)) = g(\text{Innings}) We can fit this model using the mgcv package.

fit_notouts <- mgcv::gam(prop_no ~ s(Innings), data = df_no,
           family = binomial(link = logit), weights = Innings)
notouts_aug <- broom::augment(fit_notouts, data = df_no, type.predict = "response")
notouts_aug |>
  ggplot(aes(x = Innings, y = prop_no)) +
  geom_point(alpha = 0.2) +
  geom_line(aes(y = .fitted), color = "#0072B2") +
  geom_ribbon(aes(ymin = .fitted - 2*.se.fit,
                  ymax = .fitted + 2*.se.fit),
              fill = "#0072B2", alpha = 0.2) +
  labs(y = "Proportion of not outs")
Figure 6.5: Proportion of not outs for each batter as a function of the number of innings they played, with a GAM fit using a Binomial distribution. The blue line shows the probability of a batter being not out as a function of the number of Innings they have played.

Now we can use the fitted model to compute the density scores from the Binomial distribution, and find the most anomalous batters. Unfortunately, there is not a convenient way to compute loo density scores for GAM models, so we will only consider density scores in this example.

notouts_aug <- notouts_aug |>
  mutate(
    fscores = density_scores(fit_notouts),
    lookout = lookout(fit_notouts)
  ) |>
  select(Player:Country, Innings:NotOuts, prop_no:.fitted, fscores:lookout) |>
  arrange(desc(fscores))
notouts_aug
#> # A tibble: 3,702 × 8
#>    Player           Country    Innings NotOuts prop_no .fitted fscores lookout
#>    <chr>            <chr>        <int>   <int>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 JM Anderson      England        233      99   0.425  0.120     71.3 0.00869
#>  2 CS Martin        New Zeala…     104      52   0.5    0.121     47.1 0.0185 
#>  3 RGD Willis       England        128      55   0.430  0.121     40.8 0.0240 
#>  4 CA Walsh         West Indi…     185      61   0.330  0.0965    40.7 0.0241 
#>  5 TA Boult         New Zeala…      86      42   0.488  0.121     37.1 0.0286 
#>  6 M Muralitharan   ICC/Sri L…     164      56   0.341  0.102     36.8 0.0290 
#>  7 EJ Chatfield     New Zeala…      54      33   0.611  0.131     36.0 0.0301 
#>  8 BS Chandrasekhar India           80      39   0.488  0.123     34.1 0.0334 
#>  9 GD McGrath       Australia      138      51   0.370  0.118     31.8 0.0379 
#> 10 M Ntini          South Afr…     116      45   0.388  0.122     29.1 0.0447 
#> # ℹ 3,692 more rows

The most anomalous batters are all “tail-enders” (i.e., not skilled batters) who played for a long time (so they have a large number of innings). Because they batted last, or nearly last, they are more likely to be not out at the end of the team’s innings.

The .fitted value is the expected proportion of not outs for each player given the number of innings they have played, while prop_no gives the actual proportion of not outs they have had. The largest density score is for English batter Jimmy Anderson, who has had 99 not outs in 233 innings, which is much higher than the expected number of not outs of 233 \times 0.120 = 27.9. This anomaly is also seen in Figure 6.5, as being somewhat unusual for that part of the data. Although Jimmy Anderson was not a great batter, he was good at defence, and was able to bat for a long time without being dismissed, leaving the other batter time to score runs.

We have identified an anomaly that is not anomalous in the proportion of not-outs, or in the number of innings, and the difference between the actual proportion and the predicted proportion is not anomalous either compared to some of the other values. However, because we have used a statistical model, we have been able to account for the particular features of this data set, such as the discrete nature of the data, and the changing variance, to identify an observation that is anomalous in the context of the model.

6.4 KDE scores

Suppose, instead of a regression or a GAM, we estimate f using a kernel density estimate. Then we call the resulting density scores “kde scores”. The kernel density estimate at each observation is (Equation 3.3) f_i = \hat{f}(\bm{y}_i) = \frac{1}{n} \sum_{j=1}^n K_H(\bm{y}_i-\bm{y}_j), \tag{6.10} and so the “kde score” at each observation as s_i = -\log(f_i). The largest possible score occurs when an observation has no other observations nearby. Then f_i \approx K_H(\bm{0})/n because K_H(\bm{y}_i-\bm{y}_j)\approx 0 when \|\bm{y}_i-\bm{y}_j\| is large. So the largest possible kde score, when using a Gaussian kernel, is -\log(K_H(\bm{0})/n) \approx \log(n) + \frac{d}{2}\log(2\pi) + \frac{1}{2}\text{log det}(\bm{H}), where \bm{H} is now the bandwidth matrix. For univariate data, when d=1, this simplifies to -\log(K_h(0)/n) \approx \log(nh\sqrt{2\pi}).

Leave-one-out kde scores

The contribution of the ith point to the kernel density estimate at that point is K_H(\bm{0})/n. Therefore, we can compute leave-one-out kde scores as f_{-i} = \left[nf_i - K_H(\bm{0})\right]/(n-1), \tag{6.11} where f_i is the kde estimate at \bm{y}_i using all data. Thus, we can compute the leave-one-out kernel density scores without needing to re-estimate the density many times.

Example: Old Faithful eruption durations

For the Old Faithful eruption duration data, we obtain the following LOO kde scores.

of_scores <- oldfaithful |>
  mutate(
    score = density_scores(duration, loo = FALSE),
    loo_score = density_scores(duration, loo = TRUE),
    lookout = lookout(duration)
  )
of_scores |> arrange(desc(loo_score))
#> # A tibble: 2,261 × 6
#>    time                duration waiting score loo_score lookout
#>    <dttm>                 <dbl>   <dbl> <dbl>     <dbl>   <dbl>
#>  1 2015-12-07 00:09:00     7200    3420 10.8     Inf     0     
#>  2 2018-04-25 19:08:00        1    5700 10.8     Inf     0     
#>  3 2020-10-25 16:31:00      305    6060  8.20      8.27  0.0740
#>  4 2020-05-07 22:55:00      304    6180  8.10      8.16  0.0862
#>  5 2020-05-21 00:27:00      304    5640  8.10      8.16  0.0862
#>  6 2017-09-01 01:18:00      155    4560  8.02      8.09  0.0963
#>  7 2017-09-07 04:56:00      155    4740  8.02      8.09  0.0963
#>  8 2015-11-21 20:27:00      150    3420  7.97      8.03  0.104 
#>  9 2016-03-22 14:06:00      150    4860  7.97      8.03  0.104 
#> 10 2016-07-13 21:55:00      150    4980  7.97      8.03  0.104 
#> # ℹ 2,251 more rows

The two infinite LOO scores correspond to the extreme 2 hour duration, and the tiny 1 second duration. These are so improbable given the rest of the data, that the scores are effectively infinite. The regular kde scores (with loo = FALSE) are at their maximum values.

Figure 6.6 shows an HDR boxplot of the data (other than the maximum), with those points identified as lookout anomalies highlighted in red. Notice that we can simply add the argument show_lookout = TRUE in order to highlight points with lookout probabilities less than 0.05.

oldfaithful  |>
  filter(duration < 7200)  |>
  gg_hdrboxplot(duration, show_lookout = TRUE)
Figure 6.6: HDR boxplot of the Old Faithful eruption durations, with the lookout anomalies highlighted in red.

More examples

Let’s apply the lookout algorithm to the six examples introduced in Section 4.1.

cricket_batting |>
  filter(Innings > 20) |>
  mutate(lookout = lookout(Average)) |>
  filter(lookout < 0.05) |>
  select(Player, Average, lookout)
#> # A tibble: 1 × 3
#>   Player     Average  lookout
#>   <chr>        <dbl>    <dbl>
#> 1 DG Bradman    99.9 0.000193

Here Bradman is a clear anomaly (with a very low lookout probability), and no-one else is identified as a possible anomaly.

The same algorithm is easily applied in two dimensions. Here we use the oldfaithful data set, and consider the distribution of Duration and Waiting time, ignoring those observations that are greater than 2 hours in either dimension.

of <- oldfaithful  |>
  select(duration, waiting) |>
  filter(duration < 7200, waiting < 7200)
of |>
  mutate(
    loo_scores = density_scores(of, loo = TRUE),
    lookout = lookout(of)
  ) |>
  filter(lookout < 0.05) |>
  arrange(lookout, duration)
#> # A tibble: 6 × 4
#>   duration waiting loo_scores lookout
#>      <dbl>   <dbl>      <dbl>   <dbl>
#> 1        1    5700      Inf   0      
#> 2      120    6060      Inf   0      
#> 3      170    3600       17.6 0.00302
#> 4      170    3840       17.5 0.00478
#> 5      150    3420       17.4 0.00596
#> 6       90    4740       16.9 0.0178

Now, six anomalies are identified, with two of them having infinite LOO density scores. We can visualize them in an HDR scatterplot, shown in Figure 6.7.

Code
of |>
  gg_hdrboxplot(duration, waiting, scatterplot = TRUE, show_lookout = TRUE)
Figure 6.7: HDR scatterplot of the Old Faithful eruption durations and waiting times, with the lookout anomalies highlighted in red.

Next we consider some artificial examples. First, we consider the first 48 rows of the second variable in the n01 data, along with the values 4.0 and 4.5.

n01b <- tibble(y = c(n01$v2[1:48], 4, 4.5))
n01b |>
  mutate(lookout = lookout(y)) |>
  filter(lookout < 0.05) |>
  arrange(lookout)
#> # A tibble: 2 × 2
#>       y lookout
#>   <dbl>   <dbl>
#> 1   4         0
#> 2   4.5       0

As expected, only the two genuine anomalies have been identified.

Finally, we consider 1000 simulated observations from each of the distributions, N(0,1), \text{t}_3 and \chi^2_4.

n01 |>
  select(v1) |>
  mutate(lookout = lookout(v1)) |>
  filter(lookout < 0.05) |>
  arrange(lookout, v1)
#> # A tibble: 2 × 2
#>      v1 lookout
#>   <dbl>   <dbl>
#> 1  3.81 0.00160
#> 2  3.06 0.0324
set.seed(1)
tibble(y = rt(1000, df = 3)) |>
  mutate(lookout = lookout(y)) |>
  filter(lookout < 0.05) |>
  arrange(lookout, y)
#> # A tibble: 5 × 2
#>        y lookout
#>    <dbl>   <dbl>
#> 1 -11.4  0      
#> 2   7.65 0      
#> 3  10.5  0      
#> 4  -7.40 0.00175
#> 5  -6.61 0.0337
tibble(y = rchisq(1000, df = 4)) |>
  mutate(lookout = lookout(y)) |>
  filter(lookout < 0.05) |>
  arrange(lookout, y)
#> # A tibble: 2 × 2
#>       y lookout
#>   <dbl>   <dbl>
#> 1  17.0 0.00314
#> 2  15.2 0.0350

The algorithm has found a small number of spurious anomalies in each case, out of the 1000 observations included. Notably, the results do not appear to deteriorate with the heavier-tailed or skewed distributions.

6.5 Other density-based methods