ChatPaper.aiChatPaper

协同多智能体测试时强化学习推理

Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

January 14, 2026
作者: Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park
cs.AI

摘要

多智能体系统已发展为多种应用场景中实用的LLM驱动协作体,其通过多样性与交叉验证获得稳健性。然而多智能体强化学习(MARL)训练存在资源消耗大、稳定性差的问题:智能体间的协同适应会引发环境非平稳性,且奖励信号往往稀疏且高方差。为此,我们提出多智能体测试时强化学习(MATTRL)框架,该框架在推理阶段将结构化文本经验注入多智能体决策过程。MATTRL通过组建多专家团队开展多轮讨论,检索并整合测试时经验,最终达成共识决策。我们还研究了信用分配机制,用于构建轮次级经验池并将其重新注入对话流程。在医学、数学、教育等挑战性测试基准上,MATTRL相较多智能体基线平均准确率提升3.67%,较单智能体基线提升8.67%。消融实验检验了不同信用分配方案,并详细比较了其对训练结果的影响。MATTRL为分布偏移鲁棒的多智能体推理提供了一条稳定、高效且无需调参的实现路径。
English
Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce Multi-Agent Test-Time Reinforcement Learning (MATTRL), a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.
PDF633January 17, 2026