Reimagining a New Era of Incident Response as LLMs Advance
- By Peter Marelas, New Relic
- May 13, 2024
As Asia Pacific organizations continue to accelerate their digitalization, they face tremendous pressure to keep everything running against an increasingly complex IT environment. The stakes are arguably higher in the region than anywhere else. New Relic 2023’s Observability Forecast found that Asia Pacific had the highest median annual outage cost by far—more than double the figure in Europe and nearly 16x that of North America.
Their IT teams are not only saddled with the responsibility to find and fix incidents as quickly as possible but also need to prevent those costly incidents from occurring again. Naturally, many IT leaders in the region are watching the emergence of AI and the evolution of large language models (LLMs) and their potential to change incident response as we know it.
Prevention is the North Star of incident response with AI, but experience matters
Many teams are already beginning to see how AIOps technology can help minimize issues or impact on customer experience, i.e., proactive anomaly detection, incident correlation to reduce alert noise, and automated probable root-cause analysis.
The promise of AI in minimizing IT incidents appears infinite, with some going as far as saying it will eventually achieve the goal of preventing disruptions and outages altogether. However, skipping any fundamental steps in that journey or limiting the experience of IT teams working through incident responses today could prove detrimental to the advancement of LLMs.
For many IT teams today, detecting potential problems before they turn into incidents still takes too much time. Teams often work reactively, firefighting incidents while never finding time to implement processes that allow them to identify issues before they cause disruptions.
To master prevention with the support of LLMs, the teams need to live through finding and fixing incidents. You cannot skip this step as it's the experience a user learns from finding and fixing incidents that enables them to learn the skills to implement mitigation strategies and take preventative measures. The experience will enrich both the human teams and the capability of LLMs to understand and rationalize extensive data sets and accomplish the varied array of tasks within the incident response life cycle.
Three ways LLMs will transform incident response
The incident response life cycle can vary from organization to organization and even team to team. Here are some of the possibilities within critical tasks across the incident response life cycle:
- Research: When an incident occurs, an engineer's first step is to gather information and research the problem space. LLMs have a significant role to play in this process. With access to current and historical data, LLMs will be capable of analyzing the incident, searching past incidents to draw on past experiences, and reasoning over this data to recommend a potential path forward. By undertaking the role of the researcher, SRE teams will save significant amounts of manual hours.
- Troubleshooting & Diagnosis: As LLMs evolve, teams can draw on the same research function using broader knowledge bases to help investigate an incident, including identifying run-books applicable to an incident. As the knowledge base extends beyond the organization to external knowledge, AI agents can perform automated root cause analysis through iterative evaluation of hypotheses that draw on local experiences and world knowledge. They will be able to mimic human cognition and perform reasoning and actions through dialogue with human teams to fill in gaps from earlier stages, then assist by making suggestions. The value to engineering lies in a shorter mean-time-to-understanding of the impact and cause of incidents, while the value to the business lies in a shorter mean-time-to-resolution.
- Incident Postmortems & Documentation: Engineers collect, summarize, and produce a postmortem after an incident. An incident postmortem involves dissecting failures to gain insights into why they occurred, how they impacted operations, and, most importantly, how to prevent them in the future. This process can take weeks. Through search, summarization and reasoning abilities, LLMs can facilitate the initial stages of creating a post-incident review by collecting, collating, summarizing, and analyzing the data, then making recommendations relating to mitigation strategies, reducing the cognitive load on engineers and saving them a significant amount of time.
As LLMs become more sophisticated, organizations and their IT teams can certainly look forward to the benefits of how incidents are managed and eventually prevented. The caveat is that there are no shortcuts to the process, and more importantly, there is no substitute for the lived experience of human teams.
LLMs require human teams to have a wealth of lived and documented incident response experience to perform tasks based on logical reasoning effectively. Only then will the tools produce the anticipated positive impact on incident response times, resolution times, and overall outcomes. The next chapter of incident response will be powered by greater efficiency in how organizations respond, manage and learn from incidents, underscored by intelligence, automation, and human-machine collaboration.
The views and opinions expressed in this article are those of the author and do not necessarily reflect those of CDOTrends. Image credit: iStockphoto/ Yuriy Altukhov
Peter Marelas, New Relic
Peter Marelas is the chief architect for APJ at New Relic.