The 4P Strategy for Preventive Healing using AI/ML in Enterprise ITOps
With the rapid proliferation of cloud-based services across the globe there is a need for round-the-clock availability and monitoring of these services, necessitating the use of AI and ML in enterprise ITOps. Even a brief downtime (outage) in online services cannot be tolerated as this may result, not only in immediate loss of revenue for the business, but also a long-term negative impact on the brand value. Just imagine what will happen if a popular e-commerce website crashes just before the Black Friday sale!
A report by Gartner estimates that the average cost of an outage can be significant — sometimes as much as 540,000$ per hour. As such, it is quite evident that we can no longer rely on the traditional break-fix model — detecting an outage or transaction degradation and then reacting to it by doing post-facto root-cause-analysis (RCA). Instead we should start focusing on preventing the outage from happening and taking pro-active action as much as possible to minimize damage. This can be a paradigm shift for enterprise ITOps and save millions of dollars by preventing outages and reducing downtime.
To give an analogy from crime investigation the old break-fix model is similar to traditional cops and detectives (like Sherlock Holmes) who start investigation after a crime is committed. The predictive model, on the other hand, is akin to the Minority Report model (see picture above) where precogs predict and arrest criminals even before the crime is committed.
Although there have been a few attempts in recent times to solve the predictive problem they have mostly been unreliable. Most of the current AIOps tools suffer from low precision resulting in alert storms. This leads to valuable man-hours being wasted on false alerts and renders the predictive system ineffective for most practical purposes. Also, there is currently no attempt to classify and predict the problem type and as such predictive RCA is either not possible or complete unreliable. Likewise, automated healing actions are also virtually non-existent. Most of the diagnosis and remediation is done manually. Some automated remediation exists in the form of robotic process automation (RPA) but they are predominantly rule based and cannot be relied upon for most practical purposes.
This is where AI can be a game changer. With the right AI we can move towards a reliable preventive model which will not only predict brewing problems but also suggest remediation steps. In this blog I will outline a recipe for such a preventive healing model based on the 4 pillars viz. Predict, Probe, Prevent and Plan. The approach is described in a nutshell below.
Predict (Is there a problem?)
The first step towards preventive healing is prediction of an anomaly. Since all the downstream activities will depend on this step the prediction algorithm needs to be robust.
One limitation of traditional outlier detection systems is that they use only unsupervised learning methods and as such cannot learn from past mistakes. Supervised learning cannot be directly applied as there is no way to accurately verify whether an outlier is indeed correctly raised or not or if it has been missed. To improve the outcome of anomaly detection we can adopt a semi-supervised approach. Essentially, we can mine information from past incidents to learn about the sequence of events associated with a problem and learn to predict them correctly in future. Additionally, we can get explicit feedback from end users about the quality of predicted outliers to suppress false alerts.
In subsequent blogs I will elaborate more on how we can do this effectively to control alert storms and also reduce missed alerts (false negatives).
Probe (What is the problem?)
In the preventive approach the probe step or root-cause-analysis (as well as remediation steps), which is typically done by SREs, needs to be done automatically without manual intervention. The first step towards this is to predict the future problem with a high degree of confidence by looking back at events in the immediate past. This is significantly different from conventional post-facto analysis and requires correlation between orthogonal sources of information like alerts, incidents, traces, logs and topology. The final outcome of the analysis should be a group of correlated alerts for the predicted problem as well as the most likely origin of the fault and the fault propagation path. Details of how to achieve this will be taken up in my next blog.
Another important point to note is that cross-stack correlation enables us to not only predict the future problem but also estimate its approximate arrival time (ETA). The insight on ETA is crucial for SREs as they can get an approximate idea of how much time they have to intercept and fix the problem. Existing AIOps tools fall short in this respect.
Prevent (How to prevent a problem from affecting the end user)
Once the problem is identified accurately the AIOps tool should be able to proactively trigger healing actions and initiate auto remediation workflows. This can include dynamically optimizing the workload or provisioning additional servers to handle the workload and so on. These tasks need to be either completely automated or some solution recommendations can be given.
An important insight that can be provided by an AIOps tool to facilitate resolution is incident similarity. The AI engine can integrate with ITSM tools (like ServiceNow, Remedy etc.) to automatically provide a list of historical incidents similar to the predicted problem(s). This can help the SREs debug or triage the problem by looking at what actions were taken for a similar problem in the past. Another useful insight can be the impact radius or the set of services/instances impacted by the predicted problem.
With an even more advanced approach we can start suggesting automated healing actions or solution recommendations -more on that in my next blog.
Plan (Forecasting and managing future problems)
Although the previous 3 steps provide a good enough strategy for preventing problems in the near or immediate future, we still can’t prevent problems that may crop up in the future. Imagine a Black Friday sale where an e-commerce platform can anticipate up to 10x the normal traffic; or an airlines company expecting a spurt in ticket sales before the holiday travel season; or a bank looking forward to a significant increase in their net-banking usage because of aggressive marketing. These kinds of scenarios cannot be handled using the predict/probe technique and requires careful planning and forecasting.
This is part of the Plan strategy where we need to monitor the transactional data and predict the likely trends in the long term based on constraints. This requires a robust time series forecasting algorithm with excellent extrapolation capabilities as the transaction data in the future may not lie in the same range as training data. This forecast can be used to indicate future choke-points or bottlenecks in the system, reduce alerts over time and help SREs and ITOps in better resource planning.
At HEAL software Inc. we are using the above 4P strategy successfully to provide actionable ML insights to our customers and prevent outages and revenue loss.
In my next series of blogs I will zoom in on some of the ideas described in this blog and explain in more technical detail as well as share some of the experiences obtained from deploying these solutions. Please stay tuned.
About the author:
Atri Mandal is currently employed with HEAL Software India as Director, Machine Learning. He holds a Masters degree in Information and Computer Science from University of California, Irvine (USA) and has over 18 years of experience in software development/research. He has worked on many interesting and complex research problems in the areas of AIOps, Machine Learning and NLP and has 12+ peer-reviewed publications in top-tier conferences and journals and 11 disclosures (in different stages of filing). He also has experience in organizing and presenting in leading academic and business conferences and events.
In his current role he heads the Machine Learning team at HEAL and is involved in building the next generation AIOps product.