7  Distance-based methods

Distance-based anomaly detection methods are based on the distances between observations. The pairwise distances between observations are computed, and anomalies are identified as those observations that are far from other observations.

Before we discuss some of the methods that fall into this class of algorithms, we will first discuss some of the different ways of measuring distances between observations.

7.1 Multivariate pairwise distances

Suppose our observations are denoted by \bm{y}_1,\dots,\bm{y}_n. In univariate data, the distance between any two observations is simply the absolute difference between the points. But in multivariate data, there are many different ways of measuring distances between observations, even when the data are all numerical.

Euclidean distance

Perhaps the most obvious way to define the distance between any two points \bm{y}_i and \bm{y}_j is to measure the length of the straight line between the points. This is commonly known as Euclidean distance, named after the famous Greek mathematician Euclid, and can be defined as \|\bm{y}_i - \bm{y}_j\|_2 = \sqrt{\sum_{k=1}^d (y_{ik}-y_{jk})^2}. This is also known as the L_2 distance. For d=1 or d=2, this is the physical distance between the points when plotted on a strip plot or a scatterplot (provided there is no jittering used).

The dist() function will return a distance matrix containing all pairwise distances between observations. The argument method specifies the type of distance to compute, with the default being Euclidean distance.

Here is an example using only the first five observations of the old_faithful data set (omitting the time stamp). Because the distances are symmetric, only the lower triangle of the matrix is computed.

head(oldfaithful, 5)
#> # A tibble: 5 × 3
#>   time                duration waiting
#>   <dttm>                 <dbl>   <dbl>
#> 1 2015-01-02 14:53:00      271    5040
#> 2 2015-01-09 23:55:00      247    6060
#> 3 2015-02-07 00:49:00      203    5460
#> 4 2015-02-14 01:09:00      195    5221
#> 5 2015-02-21 01:12:00      210    5401
of <- head(oldfaithful, 5) |> select(-time)
of |> dist()
#>         1       2       3       4
#> 2 1020.28                        
#> 3  425.47  601.61                
#> 4  196.31  840.61  239.13        
#> 5  366.12  660.04   59.41  180.62

Manhattan distance

The Manhattan distance (or absolute distance) between two observations is given by \|\bm{y}_i - \bm{y}_j\|_1 = \sum_{k=1}^d |y_{ik}-y_{jk}|. This is also known as the L_1 distance. It is called the Manhattan distance as it gives the shortest path between the corners of city blocks (denoted as points on a grid) when those blocks are rectangular, as they mostly are in Manhattan. For the same reason, it is also sometimes called the “taxicab” distance or the “city block” distance.

Let’s compute the Manhattan distances for the same five observations from the oldfaithful data set.

of |> dist(method = "manhattan")
#>      1    2    3    4
#> 2 1044               
#> 3  488  644          
#> 4  257  891  247     
#> 5  422  696   66  195

Minkowski distance

This generalizes the Manhattan and Euclidean distances to use powers of p to define the L_p distance: \|\bm{y}_i - \bm{y}_j\|_p = \left(\sum_{k=1}^d (y_{ik}-y_{jk})^p\right)^{1/p}. It is named after the German mathematician, Hermann Minkowski. When p=1 this is the Manhattan distance, and when p=2 this is the Euclidean distance.

Here is the Minkowsi distance with p = 3 for the same five observations from the oldfaithful data set.

of |> dist(method = "minkowski", p = 3)
#>         1       2       3       4
#> 2 1020.00                        
#> 3  420.59  600.08                
#> 4  185.36  839.07  239.00        
#> 5  361.58  659.04   59.03  180.03

Chebyshev Distance

The maximum distance between any components of \bm{y}_i and \bm{y}_j is known as the Chebyshev distance (after the Russian mathematician, Pafnuty Chebyshev), given by \|\bm{y}_i - \bm{y}_j\|_{\text{max}} = \max_k |y_{ik}-y_{jk}|.

of |> dist(method = "maximum")
#>      1    2    3    4
#> 2 1020               
#> 3  420  600          
#> 4  181  839  239     
#> 5  361  659   59  180

This is also known as chessboard distance, as it is the number of moves a king would have to travel on a chessboard to move between two squares. The Chebyshev distance is equivalent to the Minkowski distance with p=\infty.

Notice that all four of the distances introduced so far are equal to the absolute difference between observations when d=1. The next distance does not have this property.

Mahalanobis distance

When the variables have different scales, the variables with the largest ranges will dominate the distance measures. In the Old Faithful example, durations are much longer than waiting times, and so the duration variable is dominating the calculation of distances. Consequently, it is often preferable to scale the data before computing distances.

Suppose we scale the data using the robust multivariate approach described in Section 3.9, so that the scaled data are given by \bm{z}_i = \bm{U} (\bm{y}_i - \bm{m}), where \bm{m} is the pointwise median, and \bm{U}'\bm{U} = \bm{S}_{\text{OGK}}^{-1} is the Cholesky decomposition of the inverse of \bm{S}_{\text{OGK}}, a robust estimate of the covariance matrix of the data. Then the Euclidean distance between \bm{z}_i and \bm{z}_j is given by \| \bm{z}_i - \bm{z}_j\|_2 = \sqrt{(\bm{y}_i - \bm{y}_j)' \bm{U}' \bm{U} (\bm{y}_i - \bm{y}_j)} = \sqrt{(\bm{y}_i - \bm{y}_j)' \bm{S}_{\text{OGK}}^{-1} (\bm{y}_i - \bm{y}_j)} This is known as the Mahalanobis distance, named after the Indian statistician Prasanta Chandra Mahalanobis (although he wasn’t using a robust measure of covariance). It is a multivariate generalization of the number of standard deviations between any two observations.

Mahalanobis distance is not an option provided by the dist() function, so we will need to compute the scaled data first.

z <- oldfaithful |>
  select(-time) |>
  mvscale()
z |> head(5) |> dist()
#>        1      2      3      4
#> 2 2.8171                     
#> 3 3.7249 2.2624              
#> 4 3.8980 2.7884 0.5801       
#> 5 3.3249 2.0486 0.4014 0.7538

For more information about pairwise distances, see Borg and Groenen (2005).

7.2 Nearest neighbours

If there are n observations, then there are n(n-1)/2 pairwise distances to compute, so this is an O(n^2) operation which can take a long time for large n.

Some algorithms only compute the pairwise distances of the k nearest observations, although finding those observations requires some additional distances to be computed. For some types of distances, efficient solutions are available using kd trees (Bentley 1975; Arya et al. 1998) that find the k nearest neighbours to each observation in O(n\log(n)) time.

The calculation of k nearest neighbours is useful for more than anomaly detection problems. It is also the basis of a popular classification method due to the Berkeley statisticians Evelyn Fix and Joe Hodges (Fix and Hodges 1989) which is often known as the “kNN algorithm”.

Suppose we use the Old Faithful data to find eruptions that are neighbours in the (duration, waiting) space. The dbscan package uses kd trees to quickly identify the k nearest observations to each eruption. We will use the scaled data computed in the last section to find the 5 nearest neighbours to each eruption.

# Find 5 nearest neighbours to each eruption
knn <- dbscan::kNN(z, k = 5)
# First eruption in the data set
oldfaithful[1, ]
#> # A tibble: 1 × 3
#>   time                duration waiting
#>   <dttm>                 <dbl>   <dbl>
#> 1 2015-01-02 14:53:00      271    5040
# Five closest observations
oldfaithful[knn$id[1, ], ]
#> # A tibble: 5 × 3
#>   time                duration waiting
#>   <dttm>                 <dbl>   <dbl>
#> 1 2021-04-13 21:48:00      270    5040
#> 2 2018-09-15 19:32:00      268    5040
#> 3 2018-08-08 00:47:00      272    5160
#> 4 2016-01-28 20:36:00      270    5220
#> 5 2018-09-09 20:33:00      266    5280

For large data sets, approximations are available which speed up the computation even more, but are less accurate in finding the k nearest neighbours. The approx argument specifies a distance tolerance which makes the process faster for large data sets, although the neighbours returned may not be the exact nearest neighbours.

# Find 5 approximate nearest neighbours to each eruption
kann <- dbscan::kNN(z, k = 5, approx = 2)
# Five closest observations
oldfaithful[kann$id[1, ], ]
#> # A tibble: 5 × 3
#>   time                duration waiting
#>   <dttm>                 <dbl>   <dbl>
#> 1 2021-04-13 21:48:00      270    5040
#> 2 2018-09-15 19:32:00      268    5040
#> 3 2018-08-08 00:47:00      272    5160
#> 4 2016-01-28 20:36:00      270    5220
#> 5 2018-09-09 20:33:00      266    5280

In this case, the same five observations have been found.

7.3 Local outlier factors

A popular way of using k-nearest neighbours for anomaly detection is via local outlier factors (Breunig et al. 2000). This is similar to the idea discussed in Section 6.4 of finding points with low probability density estimates, in that it is designed to find points in areas with few surrounding observations.

Suppose we write the distance between observations \bm{y}_i and \bm{y}_j as \|\bm{y}_i - \bm{y}_j\|, and let d_{i,k} be the distance between observation \bm{y}_i and its kth nearest neighbour.

Let N_{i,k} be the set of k nearest neighbours within d_{i,k} of \bm{y}_i. (If there are multiple observations all exactly d_{i,k} from \bm{y}_i, then N_{i,k} may contain more than k observations, but we will ignore that issue here.)

Note that an observation \bm{y}_j may be within N_{i,k} while \bm{y}_i does not fall within N_{j,k}, as shown in the diagram below.

Figure 7.1: A synthetic data set of 11 observations. Nearest neighbourhoods containing five observations each are shown for \bm{y}_1 (in orange) and \bm{y}_2 (in blue). The neighbourhood of \bm{y}_1 contains \bm{y}_2 in its nearest five observations, but \bm{y}_2 does not include \bm{y}_1 in its nearest five observations.

Let r_k(i,j) = \max(d_{j,k}, \|\bm{y}_i-\bm{y}_j\|) be the “reachability” of \bm{y}_i from \bm{y}_j. Thus, r_k(i,j) is the distance between observations \bm{y}_i and \bm{y}_j when they are far from each other, but is equal to d_{j,k} if \bm{y}_i is one of the k nearest neighbours of \bm{y}_j. The reachability is a truncated variant of the usual Euclidean distance \|\bm{y}_i-\bm{y}_j\| so that it is not less than d_{j,k}.

The average reachability of \bm{y}_i from its nearest neighbours is given by \bar{r}_{i,k} = k^{-1} \sum_{\bm{y}_j \in N_{i,k}} r_k(i,j) This is not the same as the average reachability of the neighbours from \bm{y}_i which, by definition, would be d_{i,k}. An observation will have high average reachability if it is far from its neighbours, whereas it will have low average reachability if it has many close neighbours.

Then the local outlier factor for observation \bm{y}_i is given by \ell_{i,k} = \frac{\bar{r}_{i,k}}{k}\sum_{\bm{y}_j \in N_{i,k}} \bar{r}^{-1}_{j,k}. Thus, it is the ratio between \bar{r}_{i,k} and the average value of \bar{r}^{-1}_{j,k} for points within its neighbourhood.

This provides a relative measure of the density of each observation compared to its neighbours. An observation is regarded as an anomaly if it is in a low-density region (far from its neighbours) while its neighbours are in higher density regions (with many neighbours nearby). A value of \ell_{i,k} much greater than 1 shows that the observation has much larger average reachability compared to its neighbours, and so is regarded as an anomaly.

One problem with this definition is that \ell_{i,k} can be infinite. If the data set contains at least k+1 identical observations, then the average reachability \bar{r}_k(j,k) will be 0 if \bm{y}_j is one of the group of identical observations. Consequently, \ell_{i,k} = \infty if \bm{y}_i is a neighbour to one of the group of identical observations. In this book we use a slightly different definition from that used elsewhere and replace these infinite values with zeros.

Another problem is that it can be difficult to determine an appropriate value for k.

Example: Old Faithful data

Code
of_scores <- oldfaithful |>
  mutate(lof = lof_scores(duration, k = 150))
of_scores |> arrange(desc(lof))
#> # A tibble: 2,261 × 4
#>    time                duration waiting    lof
#>    <dttm>                 <dbl>   <dbl>  <dbl>
#>  1 2015-12-07 00:09:00     7200    3420 752.  
#>  2 2018-04-25 19:08:00        1    5700   8.65
#>  3 2020-10-25 16:31:00      305    6060   3.26
#>  4 2020-05-07 22:55:00      304    6180   3.17
#>  5 2020-05-21 00:27:00      304    5640   3.17
#>  6 2019-03-14 19:28:00      302    5820   2.99
#>  7 2018-09-14 20:35:00      301    5640   2.90
#>  8 2019-07-25 06:32:00      300    5280   2.81
#>  9 2019-09-17 17:22:00      300    6060   2.81
#> 10 2020-05-07 20:02:00      299    5940   2.72
#> # ℹ 2,251 more rows
Code
of_scores |>
  filter(duration < 7000) |>
  ggplot(aes(x = duration, y = lof)) +
  geom_point()
Figure 7.2: TBC

Again, the two large scores correspond to the extreme 2 hour duration and the tiny 1 second duration. The value of k=150 has been chosen after experimenting with various values.

7.4 Stahel-Donoho outlyingness

The Stahel-Donoho outlyingness is a robust measure of outlyingness

Stahel (1981), Donoho (1982), Brys, Hubert, and Rousseeuw (2005)

robustbase::adjOutlyingness

7.5 CovMCD

Rousseeuw and Van Driessen (1999)

robustbase::covMcd

7.6 Detect deviating cells algorithm

Rousseeuw and Van den Bossche (2016)

7.7 Stray and HDoutliers algorithms

Code
of_scores <- of_scores |>
  mutate(stray = stray_scores(duration))
of_scores |> arrange(desc(stray))
#> # A tibble: 2,261 × 5
#>    time                duration waiting    lof    stray
#>    <dttm>                 <dbl>   <dbl>  <dbl>    <dbl>
#>  1 2015-12-07 00:09:00     7200    3420 752.   0.958   
#>  2 2018-04-25 19:08:00        1    5700   8.65 0.0124  
#>  3 2017-05-03 06:19:00       90    4740   1.79 0.00139 
#>  4 2015-11-21 20:27:00      150    3420   1.78 0.00125 
#>  5 2016-03-22 14:06:00      150    4860   1.78 0.00125 
#>  6 2016-07-13 21:55:00      150    4980   1.78 0.00125 
#>  7 2020-06-18 02:00:00       92    3600   1.67 0.00111 
#>  8 2021-08-14 21:38:06       92    3882   1.67 0.00111 
#>  9 2020-09-25 20:00:00       93    3540   1.61 0.000972
#> 10 2021-07-08 00:44:12      173   47628   1.63 0.000972
#> # ℹ 2,251 more rows
Code
of_scores |>
  filter(duration < 7000) |>
  ggplot(aes(x = duration, y = stray)) +
  geom_point()
Figure 7.3

Talagala, Hyndman, and Smith-Miles (2019), Wilkinson (2017)

7.8 Bacon algorithm

blocked adaptive computationally efficient outlier nominators

Billor, Hadi, and Velleman (2000)

robustX::mvBACON

7.9 Isolation forest

Isolation tree: Randomly partition data until every point is isolated or some depth is reached.

Combine many isolation trees to form an isolation forest.

Anomaly score inversely proportional to average depth of isolation trees.

isotree package

Liu, Ting, and Zhou (2008)

7.10 Clustering methods

dbscan package

  • Called “density-based”, but “density” here does not mean probability density, as in ch6. Instead, it refers to the relative concentration of points in a neighbourhood.