异步软件工程代理的有效策略

摘要

人工智能代理在执行孤立软件工程任务（如解决GitHub问题）方面已日益成熟。然而涉及多个相互依赖子任务的长期任务仍面临挑战，既体现在准确率方面，也体现在及时完成方面。解决这类长期任务的天然途径是采用异步多智能体协作模式，即多个代理同时处理任务的不同部分。但实践表明多智能体系统的有效应用存在惊人难度：多代理的并发编辑会相互干扰，依赖关系难以同步，部分进展的整合也充满挑战。相比之下，人类开发者长期依赖成熟的协作基础设施来应对大型软件项目中的这些难题。受这些协作原语的启发，我们提出了集中式异步隔离委托（CAID）——一种基于三大软件工程核心原语的结构化多智能体协调范式：集中式任务委托、异步执行和隔离工作区。CAID通过中央管理器构建具备依赖感知的任务计划，在隔离工作区中并行执行子任务，并通过基于可执行测试验证的结构化集成来整合进展。实证评估显示，在论文复现任务（PaperBench）上CAID相较单智能体基线绝对准确率提升26.7%，在Python库开发任务（Commit0）上提升14.3%。通过系统分析，我们发现分支合并是多智能体协作的核心协调机制，而诸如git worktree、git commit和git merge等软件工程原语能够以可靠可执行的方式实现该机制。

English

AI agents have become increasingly capable at isolated software engineering (SWE) tasks such as resolving issues on Github. Yet long-horizon tasks involving multiple interdependent subtasks still pose challenges both with respect to accuracy, and with respect to timely completion. A natural approach to solving these long-horizon tasks in a timely manner is asynchronous multi-agent collaboration, where multiple agents work on different parts of the task at the same time. But effective application of multi-agent systems has proven surprisingly difficult: concurrent edits by multiple agents interfere with each other, dependencies are difficult to synchronize, and combining partial progress into a coherent whole is challenging. On the other hand, human developers have long relied on mature collaboration infrastructure to manage these challenges in large software projects. Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi-agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous execution, and isolated workspaces. CAID constructs dependency-aware task plans through a central manager, executes subtasks concurrently in isolated workspaces, and consolidates progress via structured integration with executable test-based verification. In empirical evaluation, we find that CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0). Through systematic analysis, we find that branch-and-merge is a central coordination mechanism for multi-agent collaboration, and that SWE primitives such as git worktree, git commit, and git merge enable it to be realized in a reliable and executable manner.

异步软件工程代理的有效策略

Effective Strategies for Asynchronous Software Engineering Agents

摘要

Support