공유 컨텍스트를 갖는 분산형 다중 에이전트 시스템

초록

다중 에이전트 시스템(MAS)은 복잡한 문제를 병렬 하위 작업으로 분해함으로써 테스트 시점에서 대규모 언어 모델의 추론을 확장할 수 있다. 그러나 기존 대부분의 MAS는 메인 에이전트가 작업을 할당하고 결과를 수집하며 최종 출력을 병합하는 중앙 집중식 조정 방식에 의존한다. 하위 작업의 수가 증가함에 따라 이러한 제어기는 통신 및 통합의 병목 지점이 된다. 본 논문에서는 분산 언어 모델(Decentralized Language Models, DeLM)을 제안한다. DeLM은 병렬 에이전트, 공유 검증 컨텍스트, 작업 큐를 통해 조정을 분산시키는 MAS 프레임워크이다. 에이전트는 비동기적으로 하위 작업을 요청하고, 축적된 진행 상황을 읽으며, 로컬 추론을 수행한 후, 간결한 검증된 업데이트를 다시 작성한다. 공유 컨텍스트는 공통 통신 기반 역할을 하여, 모든 업데이트를 중앙 제어기를 통해 라우팅하지 않고도 에이전트가 서로의 검증된 진행 상황을 기반으로 작업을 수행할 수 있게 한다. 실험적으로, DeLM은 소프트웨어 공학 테스트 시간 확장과 장문 추론 모두에서 성능을 향상시킨다. SWE-bench Verified에서 DeLM은 Avg.@1, Pass@2, Pass@4 지표 전반에 걸쳐 최고 성능을 달성하며, 가장 강력한 기준선 대비 최대 10.5퍼센트 포인트의 향상을 보였고, 작업당 비용은 약 50% 절감했다. LongBench-v2 Multi-Doc QA에서는 DeLM이 4개의 최첨단 모델 계열에서 가장 높은 평균 정확도를 기록하며, 가장 강력한 기준선 대비 최대 5.7퍼센트 포인트 향상되었다. 코드는 프로젝트 웹사이트(https://yuzhenmao.github.io/DeLM/)에서 확인할 수 있다.

English

Multi-agent systems (MAS) can scale large language model reasoning at test time by decomposing complex problems into parallel subtasks. However, most existing MAS rely on centralized orchestration, where a main agent assigns work, collects outputs, and merges results. As the number of subtasks grows, this controller becomes a communication and integration bottleneck. We propose Decentralized Language Models (DeLM), a MAS framework that decentralizes coordination through parallel agents, a shared verified context, and a task queue. Agents asynchronously claim subtasks, read accumulated progress, perform local reasoning, and write back compact verified updates. The shared context acts as a common communication substrate, enabling agents to build on one another's verified progress without routing every update through a central controller. Empirically, DeLM improves both software-engineering test-time scaling and long-context reasoning. On SWE-bench Verified, DeLM achieves the best performance across Avg.@1, Pass@2, and Pass@4, with gains of up to 10.5 percentage points over the strongest baseline, while reducing cost per task by roughly 50%. On LongBench-v2 Multi-Doc QA, DeLM achieves the highest average accuracy across four frontier model families, improving over the strongest baseline by up to 5.7 percentage points. The code is available on our project website at https://yuzhenmao.github.io/DeLM/.