MASPRM:多智能体系统过程奖励模型
MASPRM: Multi-Agent System Process Reward Model
October 28, 2025
作者: Milad Yazdani, Mahdi Mostajabdaveh, Zirui Zhou, Ying Xiong
cs.AI
摘要
多智能体系统(MAS)的实际部署需要强大的测试时性能,这推动了引导推理时搜索并选择性分配计算资源以提升质量的方法发展。我们提出多智能体系统过程奖励模型(MASPRM),该模型能为部分智能体间交互记录分配逐动作、逐智能体的价值,并作为推理时控制器使用。MASPRM通过将回报传播至局部目标进行训练,无需步骤级人工标注数据,仅依赖多智能体蒙特卡洛树搜索(MCTS)推演。在推理阶段,MASPRM引导步骤级束搜索和MCTS,将计算资源聚焦于潜力分支并实现早期剪枝。在GSM8K和MATH数据集上,结合最终答案结果奖励模型(ORM)的MASPRM引导解码,其精确匹配率较单次直通式MAS处理分别提升30.7和22.9个百分点。在GSM8K上训练的MASPRM模型无需重新训练即可零样本迁移至MATH数据集,在相同计算预算下额外提升8.4个精确匹配点。MASPRM作为一种插件式价值模型,能评估单智能体进度并补充验证器式解码器,从而实现更可靠、具备计算意识的多智能体推理。代码地址:https://github.com/milad1378yz/MASPRM
English
Practical deployment of Multi-Agent Systems (MAS) demands strong test-time
performance, motivating methods that guide inference-time search and
selectively spend compute to improve quality. We present the Multi-Agent System
Process Reward Model (MASPRM). It assigns per-action, per-agent values to
partial inter-agent transcripts and acts as an inference-time controller.
MASPRM is trained from multi-agent Monte Carlo Tree Search (MCTS) rollouts
without requiring step-level human annotations, by propagating returns to local
targets. At inference, MASPRM guides step-level beam search and MCTS, focusing
computation on promising branches and pruning early. On GSM8K and MATH,
MASPRM-guided decoding with an outcome reward model (ORM) applied to the final
answer, improves exact match (EM) over a single straight-through MAS pass by
+30.7 and +22.9 points, respectively. A MASPRM trained on GSM8K transfers
zero-shot to MATH without retraining, adding 8.4 EM points at the same
budget. MASPRM is a plug-in value model that estimates per-agent progress and
complements verifier-style decoders, enabling more reliable, compute-aware
multi-agent reasoning. Code: https://github.com/milad1378yz/MASPRM