Page Comparison

Abstract

We have been testing the evaluating Eyer's core anomaly detection algorithm of Eyer by calculating F1 score, recall, and precision using the a small part subset of the Server Machine Dataset, that is a publicly available labelled open source dataset available online. We have been testing first using the data set as it and then relabelling . Initially, we tested the dataset as is, then proceeded to relabel it. The relabelling was done following prompted by the observation that some event that by looking at the data seemed to be anomalousevents, which appeared anomalous based on the data, were not labelled as such. We are going to 'll discuss this observation further in the relabelling section dedicated to relabelling. This is a test was conducted on a reduced set of data, in the future smaller dataset, and we plan to release more notes on testing additional updates with larger datasets in the future. Our algorithm scores consistently shows high recall in Recall for both cases, the original one and the relabelled one, this means that is good in capturing all the cases, indicating its effectiveness in identifying all labelled anomalies. For the original labelling the dataset, precision and, consequently, the F1 are score were low, but become high in the relabelled case. For the relabelled case we reach an F1 = improved significantly after relabelling. In the relabelled dataset, we achieved an F1 score of 0.81 (0.86 if we consider when considering only the critical anomalies) after the first 4 four weeks of data collection have passed.

Introduction

We are continuously testing our algorithms to improve the enhance their quality of these. This note is about focuses on a single test that we have been running by specific test using a labelled open-source dataset, that , the Server Machine Dataset (SMD)¹ , which is sometimes used in literature to asses the performance of AIOps algorithm to monitor IT systems: the Server Machine Dataset (SMD) ¹evaluate AIOps algorithms for monitoring IT systems. The performance metric that we focus on are the F1 score ², recall, and precision.

The F1 score gives a measure of the trade off reflects the balance between how good your algorithm is to identify all real well the algorithm identifies all actual anomalies (recall) and how precise it ismany alerts are genuine (precision). It is defined calculated as:

F1 = 2*R*P / (R+P)

Where R and P are recall and precision, which are defined:

...

P =truePositives/(truePositives +falsePositives)

This The dataset is claimed said to be anomaly-free for in the first half (which is the train halfused for training) and labelled for anomalies in the second half. Labelling is always a difficult task and usually is done with a specific use case in mind, therefore it might appear partial often context-specific and be incomplete when taken out of its specific original context. Also Additionally, the statement claim that the first half of the dataset is anomaly-free in general can be challengedquestioned. Eyer is designed to be as generic as possible, therefore the type of labelling that one needs to do to asses the performance of our algorithm is trying to label all the changes that happens in the dataset with respect too the previous behaviour. The , not hyper tuning on a specific use case, meaning that our evaluation requires labelling all changes in the dataset compared to prior behaviour. However, the labelling in the SMD dataset does not seem to be labelled with this in mind. Finally since labelling in a lengthy activity one always have to keep in mind that open dataset follow this approach. Furthermore, as labelling is a time-consuming task, open datasets are often only partially labelled since human might easily overlook anomalies when labelling, and human errors can lead to missed anomalies.

Our approach is to use first the original labelling to see how our algorithm performs and then relabel it and measure the performance again. We are going to dedicate a section to explain involves first testing the algorithm using the original labels, then relabelling the dataset and measuring performance again. A section will be dedicated to explaining how the relabelling was doneconducted. In the future, we will repeat expand this exercise with a wider range of to include more data from the SMD . Also and other datasets will be used, both open-source and created by usinternally generated.

Methodology

Before presenting the results we have to introduce how we define true positives, false positives, and false negative. It might seem trivial to define these, but in reality an anomaly is very often not a single data point but a series of data points. In the case of IT operations the anomalies are often deviations from the usual behaviour that are persistent for some time. When labelling manually our data we might be off the actual starting and final point of said deviations, and sometimes might be impossible to precisely define a starting and finishing point, precisely. For example, is an anomaly starting when a deviation is well established or should we include in an anomaly also the oscillations that preceded the anomaly? Is an anomaly finishing when the value of a metric is back to a normal value or when it is on its way back to the normal value, but has not reached it yet?

...

Versions Compared

Old Version 14

New Version 15

Key

Abstract

Introduction

Methodology