LLM赋力的全自动混沌工程:助力低成本构建高韧性软件系统
LLM-Powered Fully Automated Chaos Engineering: Towards Enabling Anyone to Build Resilient Software Systems at Low Cost
November 11, 2025
作者: Daisuke Kikuta, Hiroki Ikeuchi, Kengo Tajiri
cs.AI
摘要
混沌工程(Chaos Engineering,CE)是一种旨在提升分布式系统韧性的工程技术。该方法通过向系统主动注入故障来测试其容错能力,发现潜在缺陷并在引发生产环境故障前进行修复。当前主流CE工具已能自动化执行预定义的混沌实验,但实验方案设计及基于结果的系统优化仍依赖人工操作。这些过程不仅劳动密集,还需要跨领域专业知识。为应对这些挑战,实现低成本构建高韧性系统的目标,本文提出ChaosEater系统——基于大语言模型实现全周期自动化的混沌工程框架。该系统依据系统化CE周期预定义智能体工作流,并将工作流中的细分流程分配给大语言模型执行。ChaosEater专注于基于Kubernetes的软件系统混沌工程,其大语言模型通过需求定义、代码生成、测试调试等软件工程任务完成CE闭环。我们通过对中小型及大规模Kubernetes系统的案例研究进行评估,结果表明该系统能以极低的时间和经济成本持续完成合理的CE闭环,其周期质量同时获得了人类工程师与大语言模型的双重验证。
English
Chaos Engineering (CE) is an engineering technique aimed at improving the resilience of distributed systems. It involves intentionally injecting faults into a system to test its resilience, uncover weaknesses, and address them before they cause failures in production. Recent CE tools automate the execution of predefined CE experiments. However, planning such experiments and improving the system based on the experimental results still remain manual. These processes are labor-intensive and require multi-domain expertise. To address these challenges and enable anyone to build resilient systems at low cost, this paper proposes ChaosEater, a system that automates the entire CE cycle with Large Language Models (LLMs). It predefines an agentic workflow according to a systematic CE cycle and assigns subdivided processes within the workflow to LLMs. ChaosEater targets CE for software systems built on Kubernetes. Therefore, the LLMs in ChaosEater complete CE cycles through software engineering tasks, including requirement definition, code generation, testing, and debugging. We evaluate ChaosEater through case studies on small- and large-scale Kubernetes systems. The results demonstrate that it consistently completes reasonable CE cycles with significantly low time and monetary costs. Its cycles are also qualitatively validated by human engineers and LLMs.