暗森林：少言而高精准的多智能体大语言模型

摘要

多智能體大型語言模型系統透過整合多個智能體的輸出結果來提升推理能力，但高度互動的方法可能導致錯誤傳播與高昂的通訊開銷。當智能體交換原始回應或推理軌跡時，錯誤的中間推理可能被採納並放大，最終形成看似正確卻實為錯誤的共識；此外，多輪通訊也會增加 token 消耗、延遲及推理成本。本文提出一個名為 DarkForest 的受控通訊協調框架。DarkForest 首先維持智能體的獨立性，使每個智能體在未參閱其他智能體輸出的情況下產出答案。接著，它將原始回應解析為結構化的候選記錄，將語義等價的候選項目分組歸類，並依據智能體的可靠性、信心度、解析品質、支持模式可靠性及獨立性修正，估算出這些群組上的校準信念分佈。協調器僅接收來自此信念狀態中經政策允許的證據，達成受控通訊。在六個推理基準上的實驗結果顯示，DarkForest 在整體品質上達到領先水準，相較於最強的基準方法，在基準指標上提升了最高 30.7%，並相較於通訊密集的基準方法，將 token 消耗降低了最高 6.5 倍。

English

Multi-agent LLM systems improve reasoning by combining outputs from multiple agents, but interaction-heavy methods can introduce error propagation and high communication overhead. When agents exchange raw responses or reasoning traces, incorrect intermediate reasoning may be adopted and amplified, leading to confident but wrong consensus; multi-round communication also increases token consumption, latency, and inference cost. In this paper, we propose a controlled-communication coordination framework named DarkForest. DarkForest first keeps agents independent, so each agent produces an answer without seeing the others' outputs. It then parses the raw responses into structured candidate records, groups semantically equivalent candidates into clusters, and estimates a calibrated belief distribution over these clusters using agent reliability, confidence, parse quality, support-pattern reliability, and independence corrections. A coordinator receives only policy-permitted evidence from this belief state with controlled communication. Experiments on six reasoning benchmarks show that DarkForest achieves leading overall quality, improves the strongest baseline by up to 30.7\% on benchmark metrics, and reduces token consumption by up to 6.5times compared with communication-heavy baselines.