CooperBench: 코딩 에이전트가 아직 동료가 될 수 없는 이유

초록

팀 갈등 해결에는 과업 특화 역량뿐만 아니라 공통된 이해를 찾아 공감대를 형성하는 사회적 지능이 필요합니다. AI 에이전트가 복잡한 작업을 점점 더 많이 협업하게 됨에 따라, 효과적인 팀원으로 기능하기 위해 조정 능력을 개발해야 합니다. 그러나 우리는 현재의 에이전트가 이러한 능력을 결여하고 있다는 가설을 세웁니다. 이를 검증하기 위해 우리는 4개 프로그래밍 언어의 12개 라이브러리에 걸친 600개 이상의 협업 코딩 과제로 구성된 CooperBench 벤치마크를 소개합니다. 각 과제는 두 에이전트에게 독립적으로 구현 가능하지만 적절한 조정 없이는 충돌할 수 있는 서로 다른 기능을 할당합니다. 과제는 전문가가 작성한 테스트가 포함된 실제 오픈소스 저장소를 기반으로 합니다. 최첨단 코딩 에이전트를 평가한 결과, 우리는 조정의 저주를 관찰했습니다: 에이전트는 각 과제를 개별적으로 수행할 때보다 함께 작업할 때 평균 30% 낮은 성공률을 보였습니다. 이는 팀원을 추가하면 일반적으로 생산성이 향상되는 인간 팀과는 극명한 대조를 이룹니다. 우리의 분석은 세 가지 주요 문제점을 드러냈습니다: (1) 의사소통 채널이 모호하고 시기 적절하지 않으며 부정확한 메시지로 마비됩니다; (2) 효과적인 의사소통이 이루어져도 에이전트는 자신의 약속에서 이탈합니다; (3) 에이전트는 종종 다른 에이전트의 계획과 의사소통에 대해 잘못된 기대를 품습니다. 대규모 시뮬레이션을 통해 우리는 역할 분담, 자원 분배, 협상 등 드물지만 흥미로운 창발적 조정 행동도 관찰했습니다. 우리의 연구는 협업 코딩을 위한 새로운 벤치마크를 제시하고 개별 에이전트 능력 추구에서 사회적 지능 개발로의 전환을 촉구합니다.

English

Resolving team conflicts requires not only task-specific competence, but also social intelligence to find common ground and build consensus. As AI agents increasingly collaborate on complex work, they must develop coordination capabilities to function as effective teammates. Yet we hypothesize that current agents lack these capabilities. To test this, we introduce CooperBench, a benchmark of over 600 collaborative coding tasks across 12 libraries in 4 programming languages. Each task assigns two agents different features that can be implemented independently but may conflict without proper coordination. Tasks are grounded in real open-source repositories with expert-written tests. Evaluating state-of-the-art coding agents, we observe the curse of coordination: agents achieve on average 30% lower success rates when working together compared to performing both tasks individually. This contrasts sharply with human teams, where adding teammates typically improves productivity. Our analysis reveals three key issues: (1) communication channels become jammed with vague, ill-timed, and inaccurate messages; (2) even with effective communication, agents deviate from their commitments; and (3) agents often hold incorrect expectations about others' plans and communication. Through large-scale simulation, we also observe rare but interesting emergent coordination behavior including role division, resource division, and negotiation. Our research presents a novel benchmark for collaborative coding and calls for a shift from pursuing individual agent capability to developing social intelligence.

CooperBench: 코딩 에이전트가 아직 동료가 될 수 없는 이유

CooperBench: Why Coding Agents Cannot be Your Teammates Yet

초록

Support