Page Comparison

Abstract

We have been evaluating Eyer's core anomaly detection algorithm by calculating F1 score, recall, and precision using a subset of the Server Machine Dataset, a publicly available labelled dataset. Initially, we tested the dataset as is, then proceeded to relabel it. The relabelling relabeling was prompted by the observation that some events, which appeared anomalous based on the data, were not labelled as such. We'll discuss this observation further in the relabelling relabeling section. This test was conducted on a smaller dataset, and we plan to release additional updates with larger datasets in the future. Our algorithm consistently shows high recall in both the original and relabelled relabeled cases, indicating its effectiveness in identifying all labelled anomalies. For the original dataset, precision and, consequently, the F1 score were low, but improved significantly after relabellingrelabeling. In the relabelled relabeled dataset, we achieved an F1 score of 0.81 (0.86 when considering only critical anomalies) after four weeks of data collection.

Introduction

We are continuously testing our algorithms to enhance their quality. This note focuses on a specific test using a labelled open-source dataset, the Server Machine Dataset (SMD)¹ , which is sometimes used in literature to evaluate AIOps algorithms for monitoring IT systems. The performance metric that we focus on are the F1 score ², recall, and precision.

The F1 score reflects the balance between how well the algorithm identifies all actual anomalies (recall) and how many alerts are genuine (precision). It is calculated as:

...

The dataset is said to be anomaly-free in the first half (used for training) and labelled for anomalies in the second half. Labelling Labeling is often context-specific and be incomplete when taken out of its original context. Additionally, the claim that the first half of the dataset is anomaly-free can be questioned. Eyer is designed to be as generic as possible, not hyper tuning on a specific use case, meaning that our evaluation requires labelling labeling all changes in the dataset compared to prior behaviourbehavior. However, the labelling labeling in the SMD dataset does not seem to follow this approach. Furthermore, as labelling labeling is a time-consuming task, open datasets are often only partially labelled, and human errors can lead to missed anomalies.

Our approach involves first testing the algorithm using the original labels, then relabelling relabeling the dataset and measuring performance again. A section will be dedicated to explaining how the relabelling relabeling was conducted. In the future, we will expand this exercise to include more data from the SMD and other datasets, both open-source and internally generated.

Methodology

Before presenting the results we have to introduce how we define Defining true positives, false positives, and false negative . It might seem trivial to define these, but in the reality an anomaly is very often not a single data point but a series of data pointsis more complex. In the case of IT operations the anomalies are often deviations from the usual behaviour behavior that are persistent for some time. When labelling manually our data we might be off the actual starting and final point of said deviations, and sometimes might be impossible to precisely define a starting and finishing point, precisely. For example, is an anomaly starting when a deviation is well established or should we include in an anomaly also the oscillations that preceded the anomaly? Is an anomaly finishing when the value of a metric is back to a normal value or when it is on its way back to the normal value, but has not reached it yet?It can be tricky to label the exact point where a deviation started or ended.

In the following a labelled anomaly is refers to a series of datapoints which are labelled marked as anomalous , the labels are going to be and used to test the algorithmsalgorithm. A detected anomaly is a series of datapoints considered flagged as anomalous by our algorithm. To capture account for the variability of labelling we have established these few in labeling, we’ve established a few key rules:

A labelled anomaly that overlaps with a detected anomalyis one or more detected anomalies is considered detected. None of the datapoints in the within that labelled anomaly are going to will be considered counted as false negatives, even if they do not correspond to datapoints in a don't exactly match the detected anomaly.
All the datapoints of If a detected anomaly are considered true positives if the detected anomaly did not last overlaps with a labelled one for more than 50% of the overlaps with the labelled anomaliesits duration, all its points are counted as true positives.
If the total duration of a detected anomaly persists for is more than 50% of 1.5 times its overlap with labeled anomalies, any additional points are labeled as false positives.
All the points of detected anomalies which do not overlap with labelled anomalies are considered false positive.
All the points of a labelled anomaly which do not overlap with a detected anomaly, therefore is not detected, are false negatives.

One thing to notice is that our algorithm, for metrics that are very active, is designed to ignore deviations that lasted less than 8 minutes ³, . This is done to reduce alert fatigue, so and means that very short anomalies will not be detected.

The Server Machine Dataset (SMD) contains consists of data coming from several multiple server machines. Each machine will be translated to , each represented as a node in Eyer. For each machine there is a , there’s multivariate time series data (, with each variable is treated as a metric in Eyer). In the The SMD the data are split in two: a train part and a test part. The train part data is divided into two parts: the training part, which is claimed to be anomaly-free, while for and the test part, where anomalies are labelled. Our

In our alerting mechanism is packing the anomalies on all the metrics of a node , we group anomalies across all metrics for a node into a single alert, therefore . This means we will test our algorithm in a multivariate way. An anomaly on a context, where an anomaly in any single metric is considered to be treated as an anomaly for the full entire node.

Our algorithm is a multilevel algorithm: It classifies each detected anomaly as low, medium and high criticality. So we are going to We will present the results both if all the anomalies of all criticalities are considered or if we focus in two ways: one considering all anomalies regardless of their criticality, and another focusing only on the anomalies with higher-criticality . When going from low to high criticality usually the recall goes down and the precision increaseanomalies. The test with high criticality only is characterised by lower the recall but higher the precision with respect to considering all levels of criticality.

For the scope of this note we are just going to focus on machine-1-1 of the SMD.

Results on the original

...

labeling

We enable our machine learning at the end of the train part of the dataset.

...

Recall : 0.997

Precision: 0.34

F1: 0.51

...

Relabeling

To illustrate how we have relabelled relabeled that data we are going to focus on a single variable of machine-1-1

...

In the following image you see the anomalies that I relabelled relabeled (in yellow)

The first relabelled relabeled anomaly is because usually for this metric the values in second half of the day do not exceed so much those in the the first half of the day, the other three are because the picks have a higher value than previously observed.

...

The last two anomalies are identified because even if they had some similarities with the anomalies seen previously they had a way more oscillatory behaviourbehavior.

Similar relabelling relabeling was done for all the variables of machine-1-1

Results for the

...

relabeled data

Anomalies of all criticalities:

...

Recall : 0.999

Precision: 0.76

F1: 0.86

Conclusion

Our algorithm have proven to have high recall on both the originally labelled data and the relabelled data. Once we relabelled the data also the precision and the F1 increased a lot. This is of course Eyer's algorithm consistently excels in recall, meaning it successfully captures all labelled anomalies in both the original and relabeled datasets. While precision and F1 scores were lower with the original labels, they improved significantly after relabeling. This is a partial test and a wider range of data and labelled data is necessary in order to further asses the precision of our Eyer’s algorithm. The very good result strong performance of our algorithm on this data it is a confirmation that our algorithm performs well on data confirms its effectiveness in handling datasets with a high degree of cyclicality, as this metric which exhibited a visible seen in the considered metrics that displayed clear daily cyclical behaviourpatterns. The method of comparison presented here is based on the criticality of the anomalies, but in Eyer we also assign criticality to the alert. The criticality of the alert is based on how many critical anomalies it contains. The alert can contain multiple node, and this was a single node analysis. We are going to asses the F1 based on alert criticality in future multi nodes experimentsanalysis presented in this note is a single node analysis, but Eyer can build alerts across multiple nodes. In the future we will use multiple nodes datasets.

Bibliography

1 - Server Machine Dataset: https://github.com/NetManAIOps/OmniAnomaly/tree/7fb0e0acf89ea49908896bcc9f9e80fcfff6baf4/ServerMachineDataset

...

Versions Compared

Old Version 15

New Version Current

Key