分散式多智能體系統與共享上下文

摘要

多智能體系統（MAS）能在測試時透過將複雜問題分解為平行子任務，來擴展大型語言模型的推理能力。然而，現有的MAS大多依賴集中式協調，由主智能體分配任務、收集輸出並合併結果。隨著子任務數量增加，此控制器會成為通訊與整合的瓶頸。我們提出去中心化語言模型（DeLM），這是一個透過平行智能體、共享驗證上下文及任務佇列來實現去中心化協調的MAS框架。智能體可以非同步地認領子任務、讀取累積進度、執行局部推理，並回寫精簡的驗證更新。共享上下文作為共同的通訊基礎，使智能體能在彼此驗證過的進度上構建，無需透過中央控制器路由每次更新。實驗上，DeLM提升了軟體工程的測試時擴展能力與長上下文推理能力。在SWE-bench Verified上，DeLM在Avg.@1、Pass@2及Pass@4三項指標均達到最佳表現，較最強基線高出最多10.5個百分點，同時每個任務成本降低約50%。在LongBench-v2多文件問答中，DeLM在四個前沿模型系列中取得最高平均準確率，較最強基線提升最多5.7個百分點。程式碼已公開於專案網站：https://yuzhenmao.github.io/DeLM/。

English

Multi-agent systems (MAS) can scale large language model reasoning at test time by decomposing complex problems into parallel subtasks. However, most existing MAS rely on centralized orchestration, where a main agent assigns work, collects outputs, and merges results. As the number of subtasks grows, this controller becomes a communication and integration bottleneck. We propose Decentralized Language Models (DeLM), a MAS framework that decentralizes coordination through parallel agents, a shared verified context, and a task queue. Agents asynchronously claim subtasks, read accumulated progress, perform local reasoning, and write back compact verified updates. The shared context acts as a common communication substrate, enabling agents to build on one another's verified progress without routing every update through a central controller. Empirically, DeLM improves both software-engineering test-time scaling and long-context reasoning. On SWE-bench Verified, DeLM achieves the best performance across Avg.@1, Pass@2, and Pass@4, with gains of up to 10.5 percentage points over the strongest baseline, while reducing cost per task by roughly 50%. On LongBench-v2 Multi-Doc QA, DeLM achieves the highest average accuracy across four frontier model families, improving over the strongest baseline by up to 5.7 percentage points. The code is available on our project website at https://yuzhenmao.github.io/DeLM/.