ChatPaper.aiChatPaper

SWE-EVO:面向长周期软件演化场景的编码智能体基准测试框架

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

December 20, 2025
作者: Minh V. T. Thai, Tue Le, Dung Nguyen Manh, Huy Phan Nhat, Nghi D. Q. Bui
cs.AI

摘要

现有针对AI编程智能体的基准测试主要聚焦于孤立、单点任务,如修复错误或实现小型功能。然而,现实世界的软件工程本质上是长期性工作:开发者需要解读高层级需求、规划跨多个文件的协同修改,并在保持现有功能的前提下通过多次迭代演进代码库。我们推出SWE-EVO基准测试,专门评估智能体应对这种长期性软件演进挑战的能力。该基准基于七个成熟开源Python项目的发布说明和版本历史构建,包含48项演进任务,要求智能体实现平均涉及21个文件的多步骤修改,并通过平均每个实例874项测试的全面测试套件进行验证。对前沿模型的实验揭示出显著的能力差距:即便是搭载OpenHands的GPT-5模型,在SWE-EVO上的解决率也仅为21%,远低于其在单点任务基准SWE-Bench Verified上65%的表现。这表明当前智能体难以应对持续性的多文件推理任务。我们还提出修复率这一细粒度指标,用于捕捉解决这些复杂长期任务过程中的部分进展。
English
Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, Tool comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on Tool, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.
PDF31December 26, 2025