异步软件工程代理的高效策略

摘要

人工智能代理在独立软件工程任务（如解决GitHub问题）上的能力日益增强。然而，涉及多个相互依赖子任务的长期任务仍存在准确性及时效性挑战。异步多智能体协作作为解决这类长期任务的天然方案，允许多个代理同时处理任务的不同部分。但多代理系统的有效应用存在显著困难：并发编辑易产生冲突、依赖关系难以同步、部分进展整合具有挑战性。相比之下，人类开发者长期依赖成熟的协作基础设施应对大型软件项目中的这些难题。受此类协作原语启发，我们提出集中式异步隔离委托（CAID）——一种基于三大软件工程核心原语的结构化多代理协调范式：集中式任务委托、异步执行和隔离工作区。CAID通过中央管理器构建依赖感知的任务计划，在隔离工作区中并行执行子任务，并通过基于可执行测试验证的结构化集成实现进展整合。实证评估表明，CAID在论文复现任务（PaperBench）上较单代理基线绝对准确率提升26.7%，在Python库开发任务（Commit0）上提升14.3%。系统分析表明，分支合并是多代理协作的核心协调机制，而git worktree、git commit和git merge等软件工程原语能使其以可靠可执行的方式实现。

English

AI agents have become increasingly capable at isolated software engineering (SWE) tasks such as resolving issues on Github. Yet long-horizon tasks involving multiple interdependent subtasks still pose challenges both with respect to accuracy, and with respect to timely completion. A natural approach to solving these long-horizon tasks in a timely manner is asynchronous multi-agent collaboration, where multiple agents work on different parts of the task at the same time. But effective application of multi-agent systems has proven surprisingly difficult: concurrent edits by multiple agents interfere with each other, dependencies are difficult to synchronize, and combining partial progress into a coherent whole is challenging. On the other hand, human developers have long relied on mature collaboration infrastructure to manage these challenges in large software projects. Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi-agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous execution, and isolated workspaces. CAID constructs dependency-aware task plans through a central manager, executes subtasks concurrently in isolated workspaces, and consolidates progress via structured integration with executable test-based verification. In empirical evaluation, we find that CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0). Through systematic analysis, we find that branch-and-merge is a central coordination mechanism for multi-agent collaboration, and that SWE primitives such as git worktree, git commit, and git merge enable it to be realized in a reliable and executable manner.