Alerting

Different anomalies on single time series are grouped in an alert containing potentially several nodes, to give a bigger context to each anomaly and also to reduce the number of alerts sent to the user. Time series that are in the same node are already considered related, so they will always be alerted together. To capture inter-nodes relations we are using the groups created by using correlations among nodes. Time series which belongs to nodes in the same group are alerted together.

Alerting is enabled once the first metrics of your environment are onboarded to the ML pipeline (Onboarding, preprocessing and filtering of the data ). The onboarding of metrics happens after a minimum of 7 days, this is to allow enough data to learn baselines and correlations. As more data are collected baselines and correlations are improved and the alerting will get less noisy as the first few weeks have passed.

When receiving an alert (Alerts - structure and data explained ) there is a field for the severity of the alert itself and a field for the severity of each deviation included in the alert. Both the severity of the alert and the severity of the anomalies can be used to setup notifications and automated actions.

Severity of the deviations on single metrics

The criticality of the deviations on single metrics is an indication of how likely it is that said deviations are anomalies. The criticality is defined using the multiple baselines.

Low: Low probability of being an anomaly. The metric has spent most of the time in the main or in the secondary corridor.

Medium: Medium probability of being an anomaly. The metric has spent some time outside all the baselines, but not the majority.

Severe: High probability of being an anomaly. The metric has spent most of time outside all the baselines.

Alert are created with deviations of any severity and updated taking every time a metric changes severity. Customised actions can be set when an alert contains at least one deviation of a certain severity or when a certain metric hits a certain severity.

Severity of the alert

The severity of the alert is based not only on the severity of the metrics included in the alert but also on how the deviation propagate on correlated metrics and nodes.

Low: The alert does not contain any severe deviation.

Medium: The alert contains at least one severe deviation, but only one node is impacted, and less than 75% percent of the metrics of that node have a severe deviation.

Severe: The alert contains several metrics that are in severe state. If only one node is involve this will have more than 75% of the metrics in a severe state. Note that nodes with only one metric will automatically trigger a severe alert if effected by a severe deviation. If an alert includes more than one node with severe deviations will be always in a severe state.

Also this severity state can be used to define customised actions.