非同期ソフトウェアエンジニアリングエージェントの効果的戦略

要旨

AIエージェントは、GitHub上の課題解決などの個別のソフトウェアエンジニアリング（SWE）タスクにおいて、その能力を急速に高めている。しかし、複数の相互依存するサブタスクを含む長期タスクは、正確性と期限までの完了の両面において、依然として課題を提起している。このような長期タスクを効率的に解決するための自然なアプローチが、非同期マルチエージェント協調である。これは、複数のエージェントがタスクの異なる部分を同時並行で作業する手法である。しかし、マルチエージェントシステムの効果的な応用は、予想以上に困難であることが証明されている。複数のエージェントによる同時編集は互いに干渉し、依存関係の同期は難しく、部分的な進捗を首尾一貫した全体に統合することは困難である。一方、人間の開発者は、大規模なソフトウェアプロジェクトにおいてこれらの課題を管理するために、長年にわたり成熟した協調インフラを利用してきた。このような協調の基本要素に着想を得て、我々は「集中型非同期分離委任（CAID: Centralized Asynchronous Isolated Delegation）」を提案する。これは、集中型タスク委任、非同期実行、分離されたワークスペースという3つの核心的なSWE基本要素に基づく、構造化されたマルチエージェント調整パラダイムである。CAIDは、中央管理マネージャーを通じて依存関係を考慮したタスク計画を構築し、分離されたワークスペースでサブタスクを並行実行し、実行可能なテストベースの検証による構造化された統合を通じて進捗を統合する。実証評価において、CAIDは論文再現タスク（PaperBench）で単一エージェントベースラインと比較して26.7%（絶対値）、Pythonライブラリ開発タスク（Commit0）で14.3%の精度向上をもたらすことがわかった。体系的分析を通じて、ブランチ・アンド・マージがマルチエージェント協調の中心的な調整メカニズムであり、git worktree、git commit、git mergeなどのSWE基本要素が、これを信頼性高く実行可能な形で実現することを可能にしていることが明らかになった。

English

AI agents have become increasingly capable at isolated software engineering (SWE) tasks such as resolving issues on Github. Yet long-horizon tasks involving multiple interdependent subtasks still pose challenges both with respect to accuracy, and with respect to timely completion. A natural approach to solving these long-horizon tasks in a timely manner is asynchronous multi-agent collaboration, where multiple agents work on different parts of the task at the same time. But effective application of multi-agent systems has proven surprisingly difficult: concurrent edits by multiple agents interfere with each other, dependencies are difficult to synchronize, and combining partial progress into a coherent whole is challenging. On the other hand, human developers have long relied on mature collaboration infrastructure to manage these challenges in large software projects. Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi-agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous execution, and isolated workspaces. CAID constructs dependency-aware task plans through a central manager, executes subtasks concurrently in isolated workspaces, and consolidates progress via structured integration with executable test-based verification. In empirical evaluation, we find that CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0). Through systematic analysis, we find that branch-and-merge is a central coordination mechanism for multi-agent collaboration, and that SWE primitives such as git worktree, git commit, and git merge enable it to be realized in a reliable and executable manner.

非同期ソフトウェアエンジニアリングエージェントの効果的戦略

Effective Strategies for Asynchronous Software Engineering Agents

要旨

Support