Note on performance testing of the core algorithm of Eyer

Abstract

We have been testing the core anomaly detection algorithm of Eyer by calculating F1 score, recall and precision using the a small part of the Server Machine Dataset, that is a labelled open source dataset available online. We have been testing first using the data set as it and then relabelling the set. The relabelling was done following the observation that some event that by looking at the data seemed to be anomalous, were not labelled as such. We are going to discuss this observation in the section dedicated to relabelling. This is a test on a reduced set of data, in the future we plan to release more notes on testing with larger datasets. Our algorithm scores high in Recall for both cases, the original one and the relabelled one, this means that is good in capturing all the labelled anomalies. For the original labelling the precision and, consequently, F1 are low, but become high in the relabelled case. For the relabelled case we reach an F1 = 0.81 (0.86 if we consider only the critical anomalies) after the first 4 weeks of data collection have passed.

Introduction

We are continuously testing our algorithms to improve the quality of these. This note is about a single test that we have been running by using a labelled open source dataset, that is sometimes used in literature to asses the performance of AIOps algorithm to monitor IT systems: the Server Machine Dataset (SMD) ¹. The performance metric that we focus on are the F1 score ², recall, and precision.

The F1 score gives a measure of the trade off between how good your algorithm is to identify all real anomalies (recall) and how precise it is. It is defined as:

F1 = 2*R*P / (R+P)

Where R and P are recall and precision, which are defined:

R= truePositives/(truePositives + falseNegatives)

P =truePositives/(truePositives +falsePositives)

This dataset is claimed to be anomaly free for the first half (which is the train half) and labelled for the second half. Labelling is always a difficult task and usually is done with a specific use case in mind, therefore it might appear partial when taken out of its specific context. Also the statement that the first half of the dataset is anomaly free in general can be challenged. Eyer is designed to be as generic as possible, therefore the type of labelling that one needs to do to asses the performance of our algorithm is trying to label all the changes that happens in the dataset with respect too the previous behaviour. The SMD dataset does not seem to be labelled with this in mind. Finally since labelling in a lengthy activity one always have to keep in mind that open dataset are often only partially labelled since human might easily overlook anomalies when labelling.

Our approach is to use first the original labelling to see how our algorithm performs and then relabel it and measure the performance again. We are going to dedicate a section to explain how the relabelling was done.

Methodology

Before presenting the results we have to introduce how we define true positives, false positives, and false negative. It might seem trivial to define these, but in reality an anomaly is very often not a single data point but a series of data points. In the case of IT operations the anomalies are often deviations from the usual behaviour that are persistent for some time. When labelling manually our data we might be off the actual starting and final point of said deviations, and sometimes might be impossible to precisely define a starting and finishing point, precisely. For example: is an anomaly starting when a deviation is well established or should we include in an anomaly also the oscillations that preceded the anomaly? Is an anomaly finishing when the value of a metric is back to a normal value or when it is on its way back to the normal value, but has not reached it yet?

To capture the variability of labelling we have established these few rules:

A labelled anomaly (a series of datapoints which were labelled as anomalous) that overlaps with a detected anomaly (a series of data point that are defined anomalous by our algorithm) is considered discovered: none of the datapoints in the labelled anomaly are going to be considered as false negatives, even if they do not correspond to datapoints considered anomalous by our algorithm.
All the datapoints of a detected anomaly are considered true positives if the detected anomaly did not last more than 50% of the overlaps with the labelled anomalies.
If a detected anomaly persists for more than 50% of its overlap with labeled anomalies, any additional points are labeled as false positives.
All the points of detected anomalies which do not overlap with labelled anomalies are considered false positive.
All the points of a labelled anomaly which do not overlap with a detected anomaly, therefore is not detected, are false negatives.

One thing to notice is that our algorithm, for metrics that are very active, is designed to ignore deviations that lasted less than 8 minutes ³, to reduce alert fatigue, so very short anomalies will not be detected.

The SMD contains several metrics for each machine and each machine will be translated to a node in Eyer. Our alerting mechanism is packing the anomalies on all the metrics of a node a single alert, therefore we will test our algorithm in a multivariate way. An anomaly on a single metric is considered to be an anomaly for the full node.

Our algorithm is a multilevel algorithm: It classifies each detected anomaly as low, medium and high criticality. So we are going to present the results both if all the anomalies of all criticalities are considered or if we focus only on the anomalies with higher criticality. When going from low to high criticality usually the recall goes down and the precision increase.

For the scope of this note we are just going to focus on machine-1-1 of the SMD.

Results on the original labelling

We enable our machine learning at the end of the train part of the dataset.

Anomalies of all criticalities:

Recall : 0.997

Precision: 0.13

F1: 0.23

Only critical anomalies:

Recall : 0.997

Precision: 0.16

F1: 0.27

This means that our ML was able to discover almost all the labelled anomalies, with the exception of the very short ones, that our algorithm is not designed to detect.

The precision is low, so there have been many false positives, and this dragged down the the F1.

Our algorithm is designed to become more precise with time, so here are the results by excluding the anomalies detected in the first week after we enabled the anomaly detection:

Anomalies of all criticalities:

Recall : 0.999

Precision: 0.29

F1: 0.45

Only critical anomalies:

Recall : 0.997

Precision: 0.34

F1: 0.51

Relabelling

To illustrate how we have relabelled that data we are going to focus on a single variable of machine-1-1

These are the training data for this variable:

Screenshot 2024-05-10 at 11.20.25.png

We could already challenge the statement that these data are anomaly free since we can see the spike at the beginning of the dataset (a bit before 5000) and the increase in value in the second half. These two changes can strictly speaking be consider anomalies, if we do not have the context of how a user would define an anomaly. Notice that the metric never went above about 0.5, and most of the time was way below that value.

These are the test data for which data are labelled:

let’s zoom in and circle the labelled anomalies in red:

In the following image you see the anomalies that I relabelled (in yellow)

The first relabelled anomaly is because usually for this metric the values in second half of the day do not exceed so much those in the the first half of the day, the other three are because the picks have a higher value than previously observed.

The last two anomalies are identified because even if they had some similarities with the anomalies seen previously they had a way more oscillatory behaviour.

Similar relabelling was done for all the variables of machine-1-1

Results for the relabelled data

Anomalies of all criticalities:

Recall : 1

Precision: 0.50

F1: 0.67

Only critical anomalies:

Recall : 0.999

Precision: 0.52

F1: 0.68

And if we consider the performance after one week the anomaly detection is enabled:

Anomalies of all criticalities:

Recall : 1

Precision: 0.68

F1: 0.81

Only critical anomalies:

Recall : 0.999

Precision: 0.76

F1: 0.86

Conclusion

Our algorithm have proven to have high recall on both the originally labelled data and the relabelled data. Once we relabelled the data also the precision and the F1 increased a lot. This is of course a partial test and a wider range of data and labelled data is necessary in order to further asses the precision of our algorithm. The very good result of our algorithm on this data it is a confirmation that our algorithm performs well on data with a high degree of cyclicality, as this metric which exhibited a visible daily cyclical behaviour. The method of comparison presented here is based on the criticality of the anomalies, but in Eyer we also assign criticality to the alert. The criticality of the alert is based on how many critical anomalies it contains. The alert can contain multiple node, and this was a single node analysis. We are going to asses the F1 based on alert criticality in future multi nodes experiments.

Bibliography

1 - Server Machine Dataset: https://github.com/NetManAIOps/OmniAnomaly/tree/7fb0e0acf89ea49908896bcc9f9e80fcfff6baf4/ServerMachineDataset

2 - https://en.wikipedia.org/wiki/F-score#:~:text=The%20F1%20score%20is%20the%20Dice%20coefficient%20of%20the,of%20the%20positive%20class%20increases.

3 - Eyer documentation: High frequency high activity anomaly detection