Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Before presenting the results we have to introduce how we define true positives, false positives, and false negative. It might seem trivial to define these, but in reality an anomaly is very often not a single data point but a series of data points. In the case of IT operations the anomalies are often deviations from the usual behaviour that are persistent for some time, when . When labelling manually our data we might be off the actual starting and final point of said deviations, and sometimes might be impossible to precisely define a starting and finishing point, precisely. For example: is an anomaly starting when a deviation is well established or should we include in an anomaly also the oscillations that preceded the anomaly? Is an anomaly finishing when the value of a metric is back to a normal value or when it is on its way back to the normal value, but has not reached it yet?

To capture the variability of labelling we have established these few rules:

  • A labelled anomaly (a series of datapoints which were labelled as anomalous) that overlaps with a detected anomaly (a series of data point that are defined anomalous by our algorithm) is considered discovered: none of the datapoints in the labelled anomaly are going to be considered as false negatives, even if they do not correspond to datapoints considered anomalous by our algorithm.

  • All the datapoints of a detected anomaly are considered true positives if the detected anomaly did not last more than 50% more of the overlaps with the labelled anomalies.

  • If a detected anomaly lasts persists for more than 50% of the overlaps with the labelled anomalies all the points “in excess” are labelled its overlap with labeled anomalies, any additional points are labeled as false positives.

  • All the points of detected anomalies which do not overlap with labelled anomalies are considered false positive.

  • All the points of a labelled anomaly which do not overlap with a detected anomaly, therefore is not detected, are false negatives.

One thing to notice is that our algorithm, for metrics that are very active, is designed to ignore deviations that lasted less than 8 minutes 3, to reduce alert fatigue, so very short anomalies will not be detected.

The SMD contains several metrics for each machine and each machine will be translated to a node in Eyer. Our alerting mechanism is packing the anomalies on all the metrics of a node a single alert, therefore we will test our algorithm in a multivariate way: an . An anomaly on a single metric is considered to be an anomaly for the full node.

Our algorithm is a multilevel algorithm: It classifies each detected anomaly as low, medium and high criticality. So we are going to present the results both if all the anomalies of all criticalities are considered or if we focus only on the anomalies with higher criticality. When going from low to high criticality usually the recall goes down and the precision increase.

...

2 - https://en.wikipedia.org/wiki/F-score#:~:text=The%20F1%20score%20is%20the%20Dice%20coefficient%20of%20the,of%20the%20positive%20class%20increases.

3 - Eyer documentation: High frequency high activity anomaly detection