Abstract
We have been testing the evaluating Eyer's core anomaly detection algorithm of Eyer by calculating F1 score, recall, and precision using the a small part subset of the Server Machine Dataset, that is a publicly available labelled open source dataset available online. We have been testing first using the data set as it and then relabelling the set. The relabelling was done following the observation that some event that by looking at the data seemed to be anomalous. Initially, we tested the dataset as is, then proceeded to relabel it. The relabeling was prompted by the observation that some events, which appeared anomalous based on the data, were not labelled as such. We are going to 'll discuss this observation further in the relabeling section dedicated to relabelling. This is a test was conducted on a reduced set of data, in the future smaller dataset, and we plan to release more notes on testing additional updates with larger datasets in the future. Our algorithm scores consistently shows high recall in Recall for both cases, the original one and the relabelled one, this means that is good in capturing all the relabeled cases, indicating its effectiveness in identifying all labelled anomalies. For the original labelling the dataset, precision and, consequently, the F1 are score were low, but become high in the relabelled case. For the relabelled case we reach an F1 = improved significantly after relabeling. In the relabeled dataset, we achieved an F1 score of 0.81 (0.86 if we consider when considering only the critical anomalies) after the first 4 four weeks of data collection have passed.
Introduction
We are continuously testing our algorithms to improve the enhance their quality of these. This note is about focuses on a single test that we have been running by specific test using a labelled open-source dataset, that , the Server Machine Dataset (SMD)1 , which is sometimes used in literature to asses the performance of AIOps algorithm to monitor IT systems: the Server Machine Dataset (SMD) 1evaluate AIOps algorithms for monitoring IT systems. The performance metric that we focus on are the F1 score 2, recall, and precision.
The F1 score gives a measure of the trade off reflects the balance between how good your algorithm is to identify all real well the algorithm identifies all actual anomalies (recall) and how precise it ismany alerts are genuine (precision). It is defined calculated as:
F1 = 2*R*P / (R+P)
Where R and P are recall and precision, which are defined:
...
P =truePositives/(truePositives +falsePositives)
This The dataset is claimed said to be anomaly-free for in the first half (which is the train halfused for training) and labelled for anomalies in the second half. Labelling is always a difficult task and usually is done with a specific use case in mind, therefore it might appear partial Labeling is often context-specific and be incomplete when taken out of its specific original context. Also Additionally, the statement claim that the first half of the dataset is anomaly-free in general can be challengedquestioned. Eyer is designed to be as generic as possible, therefore the type of labelling that one needs to do to asses the performance of our algorithm is trying to label all the changes that happens in the dataset with respect too the previous behaviour. The , not hyper tuning on a specific use case, meaning that our evaluation requires labeling all changes in the dataset compared to prior behavior. However, the labeling in the SMD dataset does not seem to be labelled with this in mind. Finally since labelling in a lengthy activity one always have to keep in mind that open dataset follow this approach. Furthermore, as labeling is a time-consuming task, open datasets are often only partially labelled since human might easily overlook anomalies when labelling, and human errors can lead to missed anomalies.
Our approach is to use first the original labelling to see how our algorithm performs and then relabel it and measure the performance again. We are going to dedicate a section to explain how the relabelling was done.
Methodology
Before presenting the results we have to introduce how we define involves first testing the algorithm using the original labels, then relabeling the dataset and measuring performance again. A section will be dedicated to explaining how the relabeling was conducted. In the future, we will expand this exercise to include more data from the SMD and other datasets, both open-source and internally generated.
Methodology
Defining true positives, false positives, and false negative . It might seem trivial to define these, but in the reality an anomaly is very often not a single data point but a series of data pointsis more complex. In the case of IT operations the anomalies are often deviations from the usual behaviour behavior that are persistent for some time. When labelling manually our data we might be off the actual starting and final point of said deviations, and sometimes might be impossible to precisely define a starting and finishing point, precisely. For example: is an anomaly starting when a deviation is well established or should we include in an anomaly also the oscillations that preceded the anomaly? Is an anomaly finishing when the value of a metric is back to a normal value or when it is on its way back to the normal value, but has not reached it yet?To capture the variability of labelling we have established these few It can be tricky to label the exact point where a deviation started or ended.
In the following a labelled anomaly refers to a series of datapoints marked as anomalous and used to test the algorithm. A detected anomaly is a series of datapoints flagged as anomalous by our algorithm. To account for the variability in labeling, we’ve established a few key rules:
A labelled anomaly (a series of datapoints which were labelled as anomalous) that overlaps with a detected anomaly (a series of data point that are defined anomalous by our algorithm) is considered discovered: none of the datapoints in the labelled anomaly are going to be considered one or more detected anomalies is considered detected. None of the datapoints within that labelled anomaly will be counted as false negatives, even if they do not correspond to datapoints considered anomalous by our algorithm.
All the datapoints of a detected anomaly are considered true positives if the detected anomaly did not last more than 50% of the overlaps with the labelled anomalies.
If a detected anomaly persists for more than 50% of don't exactly match the detected anomaly.
If a detected anomaly overlaps with a labelled one for more than 50% of its duration, all its points are counted as true positives.
If the total duration of a detected anomaly is more than 1.5 times its overlap with labeled anomalies, any additional points are labeled as false positives.
All the points of detected anomalies which do not overlap with labelled anomalies are considered false positive.
All the points of a labelled anomaly which do not overlap with a detected anomaly, therefore is not detected, are false negatives.
One thing to notice is that our algorithm, for metrics that are very active, is designed to ignore deviations that lasted less than 8 minutes 3, . This is done to reduce alert fatigue, so and means that very short anomalies will not be detected.
The Server Machine Dataset (SMD) contains consists of data coming from several multiple server machines. Each machine will be translated to , each represented as a node in Eyer. For each machine there is a , there’s multivariate time series data (, with each variable is treated as a metric in Eyer). In the The SMD the data are split in two: a train part and a test part. The train part data is divided into two parts: the training part, which is claimed to be anomaly-free, while for and the test part, where anomalies are labelled. Our
In our alerting mechanism is packing the anomalies on all the metrics of a node , we group anomalies across all metrics for a node into a single alert, therefore . This means we will test our algorithm in a multivariate way. An anomaly on a context, where an anomaly in any single metric is considered to be treated as an anomaly for the full entire node.
Our algorithm is a multilevel algorithm: It classifies each detected anomaly as low, medium and high criticality. So we are going to We will present the results both if all the anomalies of all criticalities are considered or if we focus in two ways: one considering all anomalies regardless of their criticality, and another focusing only on the anomalies with higher-criticality . When going from low to high criticality usually the recall goes down and the precision increaseanomalies. The test with high criticality only is characterised by lower the recall but higher the precision with respect to considering all levels of criticality.
For the scope of this note we are just going to focus on machine-1-1 of the SMD.
Results on the original
...
labeling
We enable our machine learning at the end of the train part of the dataset.
...
Recall : 0.997
Precision: 0.34
F1: 0.51
...
Relabeling
To illustrate how we have relabelled relabeled that data we are going to focus on a single variable of machine-1-1
...
These are the test data for which data are labelled:
...
let’s Let’s zoom in and circle the labelled anomalies in red:
...
In the following image you see the anomalies that I relabelled relabeled (in yellow)
The first relabelled relabeled anomaly is because usually for this metric the values in second half of the day do not exceed so much those in the the first half of the day, the other three are because the picks have a higher value than previously observed.
...
The last two anomalies are identified because even if they had some similarities with the anomalies seen previously they had a way more oscillatory behaviourbehavior.
Similar relabelling relabeling was done for all the variables of machine-1-1
Results for the
...
relabeled data
Anomalies of all criticalities:
...
Recall : 0.999
Precision: 0.76
F1: 0.86
Conclusion
Our algorithm have proven to have high recall on both the originally labelled data and the relabelled data. Once we relabelled the data also the precision and the F1 increased a lot. This is of course Eyer's algorithm consistently excels in recall, meaning it successfully captures all labelled anomalies in both the original and relabeled datasets. While precision and F1 scores were lower with the original labels, they improved significantly after relabeling. This is a partial test and a wider range of data and labelled data is necessary in order to further asses the precision of our Eyer’s algorithm. The very good result strong performance of our algorithm on this data it is a confirmation that our algorithm performs well on data confirms its effectiveness in handling datasets with a high degree of cyclicality, as this metric which exhibited a visible seen in the considered metrics that displayed clear daily cyclical behaviourpatterns. The method of comparison presented here is based on the criticality of the anomalies, but in Eyer we also assign criticality to the alert. The criticality of the alert is based on how many critical anomalies it contains. The alert can contain multiple node, and this was a single node analysis. We are going to asses the F1 based on alert criticality in future multi nodes experimentsanalysis presented in this note is a single node analysis, but Eyer can build alerts across multiple nodes. In the future we will use multiple nodes datasets.
Bibliography
1 - Server Machine Dataset: https://github.com/NetManAIOps/OmniAnomaly/tree/7fb0e0acf89ea49908896bcc9f9e80fcfff6baf4/ServerMachineDataset
...