비동기 소프트웨어 엔지니어링 에이전트의 효과적 전략

초록

AI 에이전트는 GitHub 이슈 해결과 같은 개별적인 소프트웨어 공학(SWE) 작업에서 점점 더 높은 성능을 보이고 있습니다. 그러나 여러 상호 의존적인 하위 작업으로 구성된 장기간 과제는 정확성과 신속한 완료 측면에서 여전히 어려움을 제시합니다. 이러한 장기간 과제를 신속하게 해결하기 위한 자연스러운 접근법은 비동기적 다중 에이전트 협업으로, 여러 에이전트가 동시에 작업의 다른 부분을 담당하는 방식입니다. 하지만 다중 에이전트 시스템을 효과적으로 적용하는 것은 놀라울 정도로 어려운 것으로 입증되었습니다. 여러 에이전트의 동시 편집은 서로 간섭을 일으키고, 의존성을 동기화하기 어려우며, 부분적인 진행 상황을 일관된 전체로 통합하는 것은 매우 까다롭습니다. 반면, 인간 개발자들은 대규모 소프트웨어 프로젝트에서 이러한 어려움을 관리하기 위해 오랫동안 성숙된 협업 인프라에 의존해 왔습니다. 이러한 협업 기본 요소에서 영감을 받아, 우리는 세 가지 핵심 SWE 기본 원칙(중앙 집중식 작업 위임, 비동기 실행, 분리된 작업 공간)에 기반한 구조화된 다중 에이전트 조정 패러다임인 CAID(Centralized Asynchronous Isolated Delegation)를 소개합니다. CAID는 중앙 관리자를 통해 의존성을 인지한 작업 계획을 수립하고, 분리된 작업 공간에서 하위 작업을 동시에 실행하며, 실행 가능한 테스트 기반 검증과의 구조화된 통합을 통해 진행 상황을 통합합니다. 실증 평가에서 CAID는 단일 에이전트 기준선 대비 논문 재현 작업(PaperBench)에서 26.7% 절대적 정확도 향상, Python 라이브러리 개발 작업(Commit0)에서 14.3%의 정확도 향상을 보였습니다. 체계적인 분석을 통해 브랜치 및 병합이 다중 에이전트 협업의 핵심 조정 메커니즘이며, git worktree, git commit, git merge와 같은 SWE 기본 요소들이 이를 안정적이고 실행 가능한 방식으로 구현할 수 있게 한다는 사실을 확인했습니다.

English

AI agents have become increasingly capable at isolated software engineering (SWE) tasks such as resolving issues on Github. Yet long-horizon tasks involving multiple interdependent subtasks still pose challenges both with respect to accuracy, and with respect to timely completion. A natural approach to solving these long-horizon tasks in a timely manner is asynchronous multi-agent collaboration, where multiple agents work on different parts of the task at the same time. But effective application of multi-agent systems has proven surprisingly difficult: concurrent edits by multiple agents interfere with each other, dependencies are difficult to synchronize, and combining partial progress into a coherent whole is challenging. On the other hand, human developers have long relied on mature collaboration infrastructure to manage these challenges in large software projects. Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi-agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous execution, and isolated workspaces. CAID constructs dependency-aware task plans through a central manager, executes subtasks concurrently in isolated workspaces, and consolidates progress via structured integration with executable test-based verification. In empirical evaluation, we find that CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0). Through systematic analysis, we find that branch-and-merge is a central coordination mechanism for multi-agent collaboration, and that SWE primitives such as git worktree, git commit, and git merge enable it to be realized in a reliable and executable manner.

비동기 소프트웨어 엔지니어링 에이전트의 효과적 전략

Effective Strategies for Asynchronous Software Engineering Agents

초록

Support