Reducing Incident Mean Time to Resolution Using Elasticsearch and Large Language Models |
||
![]() |
![]() |
|
© 2025 by IJCTT Journal | ||
Volume-73 Issue-3 |
||
Year of Publication : 2025 | ||
Authors : Govind Singh Rawat | ||
DOI : 10.14445/22312803/IJCTT-V73I3P116 |
How to Cite?
Govind Singh Rawat, "Reducing Incident Mean Time to Resolution Using Elasticsearch and Large Language Models," International Journal of Computer Trends and Technology, vol. 73, no. 3, pp. 125-132, 2025. Crossref, https://doi.org/10.14445/22312803/IJCTT-V73I3P116
Abstract
Enhancing incident resolution is a key focus in modern Site Reliability Engineering (SRE). This paper presents a system that combines ElasticSearch (ES) with Large Language Models (LLMs) to reduce Mean Time to Resolution (MTTR). By embedding historical alarm data, extracting essential features, and leveraging k-nearest neighbors (kNN) search, the system efficiently links past incidents, retrieves relevant resolutions, and improves operational response through LLM interaction. This continuous feedback loop enhances incident response speed and facilitates faster incident resolution.
Keywords
ElasticSearch (ES), Incident mitigation, K-nearest neighbors search, Mean Time to Resolution, Site Reliability Engineering.
Reference
[1] What is Elasticsearch?, Elastic. [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/current/elasticsearch-intro what-is-es.html/
[2] Wayne Xin Zhao et al., “A Survey of Large Language Models,” arXiv, pp. 1-144, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[3] k-Nearest Neighbor (kNN) Search, Elastic. [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/8.0/knn search.html
[4] Betsy Beyer et al., Site Reliability Engineering: How Google Runs Production Systems, O'Reilly Media, pp. 1-552, 2016.
[Google Scholar] [Publisher Link]
[5] Ollie Cook et al., On-Call, SRE Workbook Chapter 8, 2016.
[Publisher Link]
[6] What is Meant Time to Resolution?, A Guide to Incident Metrics, Instatus. [Online]. Available: https://instatus.com/blog/mttr/
[7] Mohan Sitaram, Mastering the Art of Troubleshooting Large-Scale Distributed Systems, DevOps.Com, 2024. [Online]. Available: https://devops.com/mastering-the-art-of-troubleshooting-large-scale-distributed-systems
[8] Max Landauer et al., “Deep Learning for Anomaly Detection in Log Data: A Survey,” Machine Learning with Applications, vol. 12, pp. 1-19, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Syed Abdul, LogBERT: Log File Anomaly Detection Using BERT, 2022. [Online]. Available: https://medium.com/infinstor/logbert log-file-anomaly-detection-using-bert-an-explainer-db20bfd2f91f/
[10] Salam Allawi Hussein, and Sándor R. Répás, “Anomaly Detection in Log Files Based on Machine Learning Techniques,” Journal of Electrical Systems, vol. 20, no. 3s, pp. 1299-1311, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Uday Kamath et al., Large Language Models: A Deep Dive, Bridging Theory and Practice, Springer, pp. 1-472, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Guilherme O. Campos et al., “On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study,” Data Mining and Knowledge Discovery, vol. 30, no. 4, pp. 891-927, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Real-Time Cloud Monitoring, Datadog. [Online]. Available: https://www.datadoghq.com/
[14] Enterprise Logging and Alerting, Splunk. [Online]. Available: https://www.splunk.com/
[15] Open-Source Metrics and Monitoring, Prometheus. [Online]. Available: https://prometheus.io/
[16] Slack Software. [Online]. Available: https://slack.com/
[17] Microsoft Teams Software. [Online]. Available: https://www.microsoft.com/en-us/microsoft-teams/group-chat-software/
[18] Bill Hollifield, and Eddie Habib, The Alarm Management Handbook: A Comprehensive Guide, PAS, pp. 1-261, 2010.
[Google Scholar] [Publisher Link]
[19] Seeding, Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/Seeding_(computing)
[20] What are Bots?, Slack Api. [Online]. Available: https://api.slack.com/legacy/enabling-bot-users/
[21] Bot Overview, MicroSoft Team. [Online]. Available: https://learn.microsoft.com/en-us/microsoftteams/platform/bots/overview?utm_source=chatgpt.com
[22] Thomas Erl, Ricardo Puttini, and Zaigham Mahmood, Cloud Computing: Concepts, Technology & Architecture, pp. 1-528, 2013.
[Google Scholar] [Publisher Link]
[23] OpenAI, GPT-4 Technical Report, 2023. [Online]. Available: https://openai.com/research/gpt-4
[24] Mistral AI, Mistral 7B: A Dense, Efficient, Open-Source Language Model, 2023. [Online]. Available: https://mistral.ai/
[25] Hugo Touvron et al., “LLaMA: Open and Efficient Foundation Language Models,” arXiv, pp. 1-27, 2023. [CrossRef] [Google Scholar] [Publisher Link]
[26] David Brimley, What is an Elasticsearch Index?, Elastic, 2023. [Online]. Available: https://www.elastic.co/blog/what-is-an elasticsearch-index
[27] Index, Documents and Fields, Elastic. [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/current/documents indices.html/
[28] XiPeng Qiu et al., “Pre-Trained Models for Natural Language Processing: A Survey,” Science China Technological Sciences, vol. 63, no. 1872-1897, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Ron Kohavi, Diane Tang, and Ya Xu, Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing, Cambridge University Press, 2020.
[Google Scholar] [Publisher Link]