ChatPaper.aiChatPaper

实时软件工程智能体:软件工程智能体能否实现动态自我演进?

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?

November 17, 2025
作者: Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, Lingming Zhang
cs.AI

摘要

大型语言模型(LLM)正在重塑包括软件工程在内的几乎所有行业。近年来,研究者提出了多种LLM智能体来解决现实世界的软件问题。这类软件智能体通常配备一套编码工具,能够自主决定后续动作以形成完整轨迹,从而解决端到端的软件任务。尽管前景广阔,但由于彻底穷尽智能体框架设计空间极具挑战且成本高昂,现有方案通常需要专门设计且可能仍非最优。考虑到软件智能体本质上是可进一步优化/修改的软件,研究者近期提出了多种自改进软件智能体,包括达尔文-哥德尔机(DGM)。然而,这类自改进智能体需要在特定基准测试上进行昂贵的离线训练,且可能难以在不同LLM或基准测试间良好泛化。本文提出Live-SWE-agent——首个能在解决现实软件问题过程中实时自主持续演化的在线软件智能体。具体而言,Live-SWE-agent从仅配备bash工具的最基础智能体框架(如mini-SWE-agent)起步,在解决实际软件问题时自主演化其框架实现。在广泛研究的SWE-bench Verified基准测试中,Live-SWE-agent无需测试时扩展即可达到75.4%的惊人解决率,超越所有现有开源软件智能体,逼近最佳专有方案的性能。此外,在最新的SWE-Bench Pro基准测试中,Live-SWE-agent以45.8%的解决率超越最先进的人工设计软件智能体,创下当前最佳纪录。
English
Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that Live-SWE-agent can achieve an impressive solve rate of 75.4% without test-time scaling, outperforming all existing open-source software agents and approaching the performance of the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.
PDF72December 1, 2025