MarsRL:基於智能體管道並行強化學習的進階多智能體推理系統
MarsRL: Advancing Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism
November 14, 2025
作者: Shulin Liu, Dong Du, Tao Yang, Yang Li, Boyu Qiu
cs.AI
摘要
近期大型語言模型的進展主要得益於可驗證獎勵強化學習(RLVR)與測試時擴展技術。然而,LLM有限的輸出長度制約了單次推理過程中的推理深度。多智能體推理系統通過引入求解器、驗證器和校正器等多個智能體進行迭代優化,為此提供了可行方案。儘管該方法在Gemini 2.5 Pro等閉源模型中表現優異,但由於開源模型缺乏足夠的批判與校正能力,其泛化效果受限。為解決此問題,我們提出MarsRL——一種具備智能體管道並行特性的新型強化學習框架,可對系統中所有智能體進行聯合優化。MarsRL通過引入智能體專屬獎勵機制降低獎勵噪聲,並採用管道式訓練提升長軌跡處理效率。在Qwen3-30B-A3B-Thinking-2507模型上的實驗表明,MarsRL將AIME2025準確率從86.5%提升至93.3%,BeyondAIME從64.9%提升至73.8%,甚至超越了Qwen3-235B-A22B-Thinking-2507的表現。這些成果彰顯了MarsRL在推進多智能體推理系統發展、拓展其在不同推理任務中應用潛力的價值。
English
Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test-time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi-agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed-source models like Gemini 2.5 Pro, they struggle to generalize to open-source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent-specific reward mechanisms to mitigate reward noise and employs pipeline-inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3-30B-A3B-Thinking-2507, MarsRL improves AIME2025 accuracy from 86.5% to 93.3% and BeyondAIME from 64.9% to 73.8%, even surpassing Qwen3-235B-A22B-Thinking-2507. These findings highlight the potential of MarsRL to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks.