What is AIOPs?
Artificial Intelligence for IT Operations, or AIOps, is a term coined by Gartner to describe a class of technology that utilizes machine learning and big data to enhance IT operations. It works by ingesting data points from devices throughout your network and analyzing it with machine learning models that have been trained to look for various use cases.
Harvesting the power of artificial intelligence, AIOps could correlate information between your devices to not only find issues in your network but also make predictions about potential issues before they even occur.
In this blog, we’ll explore what AIOps is, as well as the five components that make up the system. We’ll also look at practical examples of how AIOps is being used today as well as what to look for when searching for a solution.
Why Do we need AIOps?
For anyone that has worked in IT, you know that triaging and finding issues can be extremely difficult and time-consuming. Complaints from users can often be vague and require a long series of troubleshooting steps that are different depending on the use case.
Each network problem brings its own unique series of questions that must be chased down and answered by operations. Oftentimes, these triaging steps feel more like a time-consuming process of elimination.
- Animation (questions)
- Is this affecting only one user, or are there several others?
- Is that application up and performing well for most users?
- Are we seeing any errors or warnings in our log files?
- How’s the link performance?
- Have we had any recent downtime?
Ideally, IT would have the answer to these questions in real-time so they can make an informed decision. And not only for this use case but for other potential use cases that they commonly see in the field. While current IT monitoring tools provide notifications of obvious issues, they lack the intelligence and contextual awareness that is required to answer complex problems.
Current IT monitoring tools will notify you of potential problems but lack the contextual awareness that a human can provide. Its decision-making process is solely based on whether a threshold is met or an error has occurred and lacks the intelligence of contextual decision-making.
Intro to AIOps
AIOps aims to solve these problems by using Artificial Intelligence. Event and telemetry data from the devices on the network are run through many different scenarios and models to provide the answers to those triaging steps in real time.
There are several different models by which a system can learn and make decisions. In AIOps, the learning method is typically the supervised model, which means the system has to be trained on what to look for and how to score the various data points that it receives. In IT operations, there are countless numbers of use cases where our system has to be trained on so it can effectively find and remediate issues on our network.
While Machine Learning is the core of our intelligence, we still need to ingest and normalize thousands of different data points on our network, and that takes us to the second major part of our AIOps system: big data. A typical network consists of many data points that can come in the form of syslog, net flow, config changes, and many more. This leads to a massive amount of data that must be ingested, stored, and retrievable.
At its core, this is what AIOps is looking to achieve: turn hundreds or thousands of data points into information that can be used to make a decision in real-time. Ideally, this would lead to not just finding and remediating issues quicker but also finding issues before they are noticed by the user.
Gartner defines five major functions of an AIOps system that we should review: Ingestion, Topology, Correlation, Recognition, and Remediation.
The big data component of the system will ingest, index, and normalize events from devices on your network. This spans multiple devices and vendors to grab data and telemetry from all devices on your network. These events can be config changes, syslog messages, SNMP alerts, netflow, or other types of telemetry data. As you look for AIOps solutions, this step is a critical part to consider, as you make sure that your devices can be integrated and supported. The more data points supported for your devices, the better. That means a system that supports syslog, snmp, and netflow is going to have better context than a system that only supports syslog for your device.
Gartner also calls out two points the ingestion function must perform, which is to allow real-time analysis and historical analysis of stored data. Both of these functions are critical to other components of AIOps systems
The topology function relates to the discovery and mapping of IT assets, including hardware and software in the environment. This goes beyond just knowing about the device but extends to also building relationships between the devices. The same is true for a human that begins to troubleshoot a potential user issue. Having a topology view helps understand the context of potential issues between the user and the resource they are trying to access.
With data coming and relationships between devices established, the next function is to correlate the telemetry data between devices. That means a relationship is understood between the various assets on your network and how they relate to the network at large.
For example, an application used for 3rd party contractors might not have a relationship with an end-user that works in sales. However, their access to invoices may be critical. For that business group, correlating the information between all systems involved in that flow is crucial. This includes endpoints, switches along the path, routes, firewalls, and servers that all could be problematic and require further investigation.
Once we have our data, understand the relationship of the devices in our network, and correlate how the data from devices involved in a given use case, the next function is what Gartner calls Recognition. This is where issues are detected or predicted based on the machine learning training that has been done on the system.
This will undoubtedly be the most important component of the AIOps platform and will vary greatly depending on the vendor. In a supervised learning model, the system is trained for specific use cases to look out for. That means that the system is only as good as the models it has been trained to learn on. Because each vendor will have its own models and use cases that have been trained, each product can vary greatly from the next. And as we see this field evolve before our eyes, you’ll notice there are some platforms that are cloud heavy while others are focused on traditional network routing. This is where you’ll have to have a firm understanding of what you are looking for in an AIOps platform and try it in a POC to make sure it works for your environment.
Ultimately, the recognition function, just like any machine learning model, is about making a decision based on the data and training it has received.
Based on the decision or prediction that the system has made, the next phase is to actually do something. This is where the remediation phase either makes a recommendation based on the situation or automates a response to an external system. It’s unlikely most customers will opt for full automation, a more likely scenario is that a trigger alerts the network team on its findings for a human resource to make the final decision for the appropriate next steps.
AIOps is not a “set it and forget it” model. On the contrary, it’s a system that relies heavily on learning and improving over time. This means that there is more potential for false positives in the beginning and hopefully more accurate decisions and predictions over time of training the system via machine learning.
Market and Considerations
The AIOps market is still incredibly new, and most vendors are just now beginning to scratch the surface of what’s possible with Artificial Intelligence. Over time, we should expect to see more and more use cases covered as vendors introduce new training into their machine learning models. As you consider the landscape of possible solutions, here are some questions you should answer:
- Does this solution ingest data from all my major IT assets? If so, what event types are supported?
- How well does this solution integrate with my current processes?
- In other words, how can this speed up my current triaging instead of it being another thing that IT folks have to look at when troubleshooting.
- What is my use case?
- What use cases are covered in the machine learning training?
- What kind of proactive predictions can be expected?