Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This part of the pipeline is dedicated to detect anomalies on single time series. It consist of a training part (baseline creation), that periodically (every 24 hours) determines what is the normal behaviour of the time series, and of a detection (live anomaly detection) part that close to real time determines if a behaviour different from the one observed in the past (an anomaly) is ongoing.

...

This are the metrics that are richer of information, so we can make a more complete analysis. Before forming the corridor the data are re-aggregated considering the frequency, to treat missing data. In the case in which the data are aggregated as total the analysis is considering the data smoothened by re-aggregating by 5 minutes, this is to reabsorb oscillations and distinguish the case in which one metric is consistently zero for long time from the one in which there are occasional oscillation to zero. One baseline to describe the most frequent behaviour is always formed, secondary baselines are created if there are enough data that do not fit the main behaviour. This is done by fitting per hour the historical data and considering 3 standard deviations. Also an autoregressive model is learned to predict the next data point based on the last data point which came in. This autoregressive model is used in the anomaly detection phase to confirm that the trend of data points diverging from the main behaviour is to be considered anomalous. For these metrics we use a confirmation window of 15 minutes and an anomaly is detected only if there are more than 8 minutes in which the value of the metric deviated from the main corridor. This metrics are prone to frequent oscillation, so using a confirmation window reduces alert fatigue. This means that anomalies that are shorter than 8 minutes are not detectable.

...

High Frequency Low Activity Low Frequency

...

For each aggregated data point, a different weight is assigned if the point is in the main baseline (in the case of HFHA can also outside but diverging at a reasonable trend and not too far, judged by using the autoregressive model), in a secondary baseline or outside all the baselines. An anomaly score is built incrementally by averaging among these weights in time for as long as the datapoints are mostly outside the main baseline. This results in a measurement (the score) that, even if not rigorous, gives an “at a glance” description of the anomaly and can be averaged across nodes and systems. The score is then summarised into levels of criticality, classifying the data points into severe (high probability of anomaly), medium (medium probability of anomaly), low (low probability of anomaly).  A yellow anomaly has a lower likelihood of being a real anomaly than a red, could also correspond to anomalies escalating or resolving.  If we focus on the anomalies that are classified as red, they include the most severe deviations from behaviour seen in the past, even if they will not include all the anomalies.

...