ChatPaper.aiChatPaper

協作式多代理測試時強化學習推理

Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

January 14, 2026
作者: Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park
cs.AI

摘要

多智能體系統已發展成為許多實際應用中由大型語言模型驅動的協作工具,其透過多樣性與交叉驗證獲得穩健性。然而,多智能體強化學習(MARL)訓練既耗費資源又不穩定:協作夥伴的相互適應會引發非平穩性,且獎勵通常稀疏且具有高方差。為此,我們提出多智能體測試時強化學習(MATTRL)框架,該框架在推理階段將結構化文本經驗注入多智能體審議過程。MATTRL組建由專業智能體構成的多專家團隊進行多輪討論,檢索並整合測試時經驗,最終達成共識進行決策。我們還研究用於構建輪次級經驗池的信用分配機制,並將其重新注入對話流程。在醫學、數學和教育領域的挑戰性基準測試中,MATTRL相較多智能體基線模型平均準確率提升3.67%,較可比單智能體基線提升8.67%。消融研究檢驗了不同信用分配方案,並詳細比較其對訓練結果的影響。MATTRL無需調參即可為分佈偏移魯棒的多智能體推理提供穩定、有效且高效的實現路徑。
English
Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce Multi-Agent Test-Time Reinforcement Learning (MATTRL), a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.
PDF633January 17, 2026