Note on performance testing of the core algorithm of Eyer

Abstract

We have been evaluating Eyer's core anomaly detection algorithm by calculating F1 score, recall, and precision using a subset of the Server Machine Dataset, a publicly available labelled dataset. Initially, we tested the dataset as is, then proceeded to relabel it. The relabelling was prompted by the observation that some events, which appeared anomalous based on the data, were not labelled as such. We'll discuss this observation further in the relabelling section. This test was conducted on a smaller dataset, and we plan to release additional updates with larger datasets in the future. Our algorithm consistently shows high recall in both the original and relabelled cases, indicating its effectiveness in identifying all labelled anomalies. For the original dataset, precision and, consequently, the F1 score were low, but improved significantly after relabelling. In the relabelled dataset, we achieved an F1 score of 0.81 (0.86 when considering only critical anomalies) after four weeks of data collection.

Introduction

We are continuously testing our algorithms to enhance their quality. This note focuses on a specific test using a labelled open-source dataset, the Server Machine Dataset (SMD)¹ , which is sometimes used in literature to evaluate AIOps algorithms for monitoring IT systems. The performance metric that we focus on are the F1 score ², recall, and precision.

The F1 score reflects the balance between how well the algorithm identifies all actual anomalies (recall) and how many alerts are genuine (precision). It is calculated as:

F1 = 2*R*P / (R+P)

Where R and P are recall and precision, which are defined:

R= truePositives/(truePositives + falseNegatives)

P =truePositives/(truePositives +falsePositives)

The dataset is said to be anomaly-free in the first half (used for training) and labelled for anomalies in the second half. Labelling is often context-specific and be incomplete when taken out of its original context. Additionally, the claim that the first half of the dataset is anomaly-free can be questioned. Eyer is designed to be as generic as possible, not hyper tuning on a specific use case, meaning that our evaluation requires labelling all changes in the dataset compared to prior behaviour. However, the labelling in the SMD dataset does not seem to follow this approach. Furthermore, as labelling is a time-consuming task, open datasets are often only partially labelled, and human errors can lead to missed anomalies.

Our approach involves first testing the algorithm using the original labels, then relabelling the dataset and measuring performance again. A section will be dedicated to explaining how the relabelling was conducted. In the future, we will expand this exercise to include more data from the SMD and other datasets, both open-source and internally generated.

Methodology

Before presenting the results we have to introduce how we define true positives, false positives, and false negative. It might seem trivial to define these, but in reality an anomaly is very often not a single data point but a series of data points. In the case of IT operations the anomalies are often deviations from the usual behaviour that are persistent for some time. When labelling manually our data we might be off the actual starting and final point of said deviations, and sometimes might be impossible to precisely define a starting and finishing point, precisely. For example, is an anomaly starting when a deviation is well established or should we include in an anomaly also the oscillations that preceded the anomaly? Is an anomaly finishing when the value of a metric is back to a normal value or when it is on its way back to the normal value, but has not reached it yet?

In the following a labelled anomaly is a series of datapoints which are labelled as anomalous, the labels are going to be used to test the algorithms. A detected anomaly is a series of datapoints considered anomalous by our algorithm. To capture the variability of labelling we have established these few rules:

A labelled anomaly that overlaps with a detected anomalyis considered detected. None of the datapoints in the labelled anomaly are going to be considered as false negatives, even if they do not correspond to datapoints in a detected anomaly.
All the datapoints of a detected anomaly are considered true positives if the detected anomaly did not last more than 50% of the overlaps with the labelled anomalies.
If a detected anomaly persists for more than 50% of its overlap with labeled anomalies, any additional points are labeled as false positives.
All the points of detected anomalies which do not overlap with labelled anomalies are considered false positive.
All the points of a labelled anomaly which do not overlap with a detected anomaly, therefore is not detected, are false negatives.

One thing to notice is that our algorithm, for metrics that are very active, is designed to ignore deviations that lasted less than 8 minutes ³, to reduce alert fatigue, so very short anomalies will not be detected.

The Server Machine Dataset (SMD) contains data coming from several server machines. Each machine will be translated to a node in Eyer. For each machine there is a multivariate time series data (each variable is treated as a metric in Eyer). In the SMD the data are split in two: a train part and a test part. The train part is claimed to be anomaly free, while for the test part anomalies are labelled. Our alerting mechanism is packing the anomalies on all the metrics of a node a single alert, therefore we will test our algorithm in a multivariate way. An anomaly on a single metric is considered to be an anomaly for the full node.

Our algorithm is a multilevel algorithm: It classifies each detected anomaly as low, medium and high criticality. So we are going to present the results both if all the anomalies of all criticalities are considered or if we focus only on the anomalies with higher criticality. When going from low to high criticality usually the recall goes down and the precision increase.

For the scope of this note we are just going to focus on machine-1-1 of the SMD.

Results on the original labelling

We enable our machine learning at the end of the train part of the dataset.

Anomalies of all criticalities:

Recall : 0.997

Precision: 0.13

F1: 0.23

Only critical anomalies:

Recall : 0.997

Precision: 0.16

F1: 0.27

This means that our ML was able to discover almost all the labelled anomalies, with the exception of the very short ones, that our algorithm is not designed to detect.

The precision is low, so there have been many false positives, and this dragged down the the F1.

Our algorithm is designed to become more precise with time, so here are the results by excluding the anomalies detected in the first week after we enabled the anomaly detection:

Anomalies of all criticalities:

Recall : 0.999

Precision: 0.29

F1: 0.45

Only critical anomalies:

Recall : 0.997

Precision: 0.34

F1: 0.51

Relabelling

To illustrate how we have relabelled that data we are going to focus on a single variable of machine-1-1

These are the training data for this variable:

Screenshot 2024-05-10 at 11.20.25.png

We could already challenge the statement that these data are anomaly free since we can see the spike at the beginning of the dataset (a bit before 5000) and the increase in value in the second half. These two changes can strictly speaking be consider anomalies, if we do not have the context of how a user would define an anomaly. Notice that the metric never went above about 0.5, and most of the time was way below that value.

These are the test data for which data are labelled:

Let’s zoom in and circle the labelled anomalies in red:

In the following image you see the anomalies that I relabelled (in yellow)

The first relabelled anomaly is because usually for this metric the values in second half of the day do not exceed so much those in the the first half of the day, the other three are because the picks have a higher value than previously observed.

The last two anomalies are identified because even if they had some similarities with the anomalies seen previously they had a way more oscillatory behaviour.

Similar relabelling was done for all the variables of machine-1-1

Results for the relabelled data

Anomalies of all criticalities:

Recall : 1

Precision: 0.50

F1: 0.67

Only critical anomalies:

Recall : 0.999

Precision: 0.52

F1: 0.68

And if we consider the performance after one week the anomaly detection is enabled:

Anomalies of all criticalities:

Recall : 1

Precision: 0.68

F1: 0.81

Only critical anomalies:

Recall : 0.999

Precision: 0.76

F1: 0.86

Conclusion

Our algorithm have proven to have high recall on both the originally labelled data and the relabelled data. Once we relabelled the data also the precision and the F1 increased a lot. This is of course a partial test and a wider range of data and labelled data is necessary in order to further asses the precision of our algorithm. The very good result of our algorithm on this data it is a confirmation that our algorithm performs well on data with a high degree of cyclicality, as this metric which exhibited a visible daily cyclical behaviour. The method of comparison presented here is based on the criticality of the anomalies, but in Eyer we also assign criticality to the alert. The criticality of the alert is based on how many critical anomalies it contains. The alert can contain multiple node, and this was a single node analysis. We are going to asses the F1 based on alert criticality in future multi nodes experiments.

Bibliography

1 - Server Machine Dataset: https://github.com/NetManAIOps/OmniAnomaly/tree/7fb0e0acf89ea49908896bcc9f9e80fcfff6baf4/ServerMachineDataset

2 - https://en.wikipedia.org/wiki/F-score#:~:text=The%20F1%20score%20is%20the%20Dice%20coefficient%20of%20the,of%20the%20positive%20class%20increases.

3 - Eyer documentation: High frequency high activity anomaly detection