4P Strategy for AIOps: Reducing alert noise
In my previous blog I had talked about the 4Ps of preventive healing (Predict, Probe, Prevent and Plan) in enterprise AIOps. To recap, the preventive healing technique is a paradigm shift from the break-fix model of APM/AIOps and focuses on preventing an outage from happening by predicting it, rather than fixing it afterwards. To understand the importance of preventive healing, just recall the recent Facebook outage. The outage, which happened because of a wrong configuration push, lasted for 6 hours and resulted in a loss of about $6 billion for Facebook. This incident underscores the need for improved monitoring of cloud-based services by moving to a preventive model.
But before we can think of prevention, one of the first things that needs attention is how we generate our alerts. Most modern APM and AIOps systems, in trying to catch every single problem, generate too many alerts most of which are false positives. This flood of alerts or alert storms divert attention away from the real problems. In this blog I will talk about a few techniques which can reduce alert volume by removing the noise so that SREs can focus their attention on the critical problems.
Learning from the past:
Most of the traditional outlier detection systems use unsupervised models based on statistical measures. These models may be very good in doing time series analysis (trend detection, seasonality etc.) but lack in one of the most important aspects -the power to learn from history. Therein lies the real power of AI, so we should not ignore this possibility.
However, learning in the case of outliers is tricky as it’s difficult to identify with high confidence which outliers are real and which are not. Nevertheless, some amount of learning is possible with a semi-supervised approach by harvesting implicit and explicit feedback data. Implicit feedback may be obtained from incident reports (by hooking on to ITSM tools like ServiceNow) which indicate issues the outlier system had missed earlier. We can learn from the patterns of these incidents and feed back to our ML model to identify similar problems in the future.
It is not possible to get implicit information on false positives as there is no sure way to verify whether the problem was averted because of some action taken by the SREs or it was indeed a bogus event. The only way is to use explicit feedback from SREs.
It is important to note that the ground truth labels obtained by implicit and explicit means will only account for a small fraction of the training samples (keeping in mind the huge volume of alert data). As such the feedback learning will take time to catch up with reality. But given sufficient time this will help greatly improve the precision and recall of the system.
Event Correlation:
To bring more order into events we need to further correlate based on contextual information like time, topology and sequential patterns and also across different artifacts in the software stack like metrics, logs, incidents and topology. This is necessary as different events may be triggered for the same underlying root cause. Using correlation we can group these duplicate events together and reduce the number of actionable alerts in the system. The final output will be a group of correlated events pointing to the same root cause and can be collectively called an event group or alert.
The first level of correlation can be done based on the time and instance (e.g. same server/pod/cluster) and is also called spatio-temporal correlation. For example, in the case of multiple events happening simultaneously on the same instance, there is a high likelihood that the events are related to the same root cause and can be correlated.
It is also possible that some events may co-occur with a certain amount of lag. For example longer locks waits (event-1) on a DB may be usually followed by an increase in average response time (event-2) with a small delay. Or, if a downstream service goes down, the upstream services connecting to it may also start alerting after some time. These relationships can be mined from historical event data using causal inferencing techniques. Basically, we need to correlate the time series of a pair of events to determine if a dependency exists and if so, what is the direction of dependency. To ensure spurious relationships are not mined we may use the knowledge of how the services are interconnected (service topology) in the application. However the knowledge of topology is not necessary to build causal graphs as they can be mined using causal inferencing techniques alone.
A third level of correlation is possible by mining the dependence between raw metrics and event sequences. These patterns or sequences can manifest across both logs and metrics, and require correlation across event and time series data. Additionally we can take cues from historical incidents to discover such patterns. At runtime we simply look for the mined patterns within a specified time window.
Sometimes events raised by monitoring agents include a short description of the event which can include excerpts from logs, error codes and other metadata information. Such description may point to the underlying problem which caused the event to be triggered. If this is available we can additionally correlate events by matching the description (both syntactic and semantic) of the events using NLP techniques.
Last but not the least the above correlations need to be done at real time so that as soon as new events are generated they are automatically tagged or grouped with a previous event (if applicable).
Alert Ranking:
Once the events have been correlated into alerts (or event groups) we can do a further level of noise removal by ranking the groups in order of priority. To rank the alerts we need to look at the constituent events in the alert.
The individual events can be ranked by evaluating the interesting-ness of the event with respect to context. For example, if an event is periodic and happens frequently in any given interval then its presence in an interval may not be as interesting as another event which happens rarely. Additionally the frequency or scale at which the event happens in the time interval needs to be taken into consideration. For instance, if the frequency of a periodic event is much higher than its average frequency in a given time interval then it can still be considered interesting.
Once individual events are ranked we need to rank the event groups in order of relevance. For this we can use a weighted sum of the interesting-ness scores for the constituent events as explained above. Another factor which contributes to the relevance is the impact factor-that is, how likely the sequence of events in the alert is, to cause an outage. By combining both the scores using ML, we can come up with the final relevance score (and severity level) for the alerts. This should help SREs and ITOps focus on the most important problems and ignore the low priority ones.
That’s all folks ! In my next blog I’ll dive deeper into the probe (root cause analysis) techniques like fault localization and also touch upon some of the prevent techniques like incident similarity and enrichment.