MarsRL:通过强化学习与智能体流水线并行机制推进多智能体推理系统
MarsRL: Advancing Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism
November 14, 2025
作者: Shulin Liu, Dong Du, Tao Yang, Yang Li, Boyu Qiu
cs.AI
摘要
近期,大型语言模型(LLM)的发展得益于可验证奖励的强化学习(RLVR)与测试时缩放技术的推动。然而,LLM有限的输出长度制约了单次推理过程中的推理深度。多智能体推理系统通过引入求解器、验证器和校正器等多类智能体迭代优化解决方案,展现出巨大潜力。尽管该体系在Gemini 2.5 Pro等闭源模型中表现优异,但由于开源模型缺乏足够的批判与校正能力,其泛化性能仍受限。为此,我们提出MarsRL——一种采用智能体流水线并行的新型强化学习框架,可协同优化系统中的所有智能体。MarsRL通过设计智能体专属奖励机制降低奖励噪声,并采用流水线式训练提升长轨迹处理效率。在Qwen3-30B-A3B-Thinking-2507模型上的实验表明,MarsRL将AIME2025准确率从86.5%提升至93.3%,BeyondAIME准确率从64.9%提升至73.8%,甚至超越了Qwen3-235B-A22B-Thinking-2507的表现。这些发现证明MarsRL能有效推动多智能体推理系统发展,拓展其在多样化推理任务中的适用边界。
English
Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test-time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi-agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed-source models like Gemini 2.5 Pro, they struggle to generalize to open-source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent-specific reward mechanisms to mitigate reward noise and employs pipeline-inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3-30B-A3B-Thinking-2507, MarsRL improves AIME2025 accuracy from 86.5% to 93.3% and BeyondAIME from 64.9% to 73.8%, even surpassing Qwen3-235B-A22B-Thinking-2507. These findings highlight the potential of MarsRL to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks.