MASPRM：マルチエージェントシステムプロセス報酬モデル

要旨

マルチエージェントシステム（MAS）の実用的な展開には、強力な推論時性能が求められ、推論時の探索を導き、計算リソースを選択的に投入して品質を向上させる手法の開発が促進されている。本論文では、マルチエージェントシステムプロセス報酬モデル（MASPRM）を提案する。MASPRMは、エージェント間の部分的な対話記録に対してアクション単位、エージェント単位の価値を割り当て、推論時のコントローラとして機能する。MASPRMは、マルチエージェントモンテカルロ木探索（MCTS）のロールアウトから学習され、ステップ単位の人手アノテーションを必要とせず、リターンを局所的なターゲットに伝播させることで訓練される。推論時には、MASPRMはステップ単位のビームサーチとMCTSを導き、計算を有望な分岐に集中させ、早期に枝刈りを行う。GSM8KとMATHにおいて、最終回答に適用された結果報酬モデル（ORM）を用いたMASPRM誘導デコーディングは、単一のストレートスルーMASパスと比較して、完全一致（EM）をそれぞれ+30.7ポイント、+22.9ポイント改善した。GSM8Kで学習されたMASPRMは、再学習なしでMATHにゼロショット転移し、同じ計算予算で8.4 EMポイントを追加する。MASPRMは、エージェント単位の進捗を推定するプラグイン型の価値モデルであり、検証器スタイルのデコーダを補完し、より信頼性が高く計算を意識したマルチエージェント推論を可能にする。コード: https://github.com/milad1378yz/MASPRM

English

Practical deployment of Multi-Agent Systems (MAS) demands strong test-time performance, motivating methods that guide inference-time search and selectively spend compute to improve quality. We present the Multi-Agent System Process Reward Model (MASPRM). It assigns per-action, per-agent values to partial inter-agent transcripts and acts as an inference-time controller. MASPRM is trained from multi-agent Monte Carlo Tree Search (MCTS) rollouts without requiring step-level human annotations, by propagating returns to local targets. At inference, MASPRM guides step-level beam search and MCTS, focusing computation on promising branches and pruning early. On GSM8K and MATH, MASPRM-guided decoding with an outcome reward model (ORM) applied to the final answer, improves exact match (EM) over a single straight-through MAS pass by +30.7 and +22.9 points, respectively. A MASPRM trained on GSM8K transfers zero-shot to MATH without retraining, adding 8.4 EM points at the same budget. MASPRM is a plug-in value model that estimates per-agent progress and complements verifier-style decoders, enabling more reliable, compute-aware multi-agent reasoning. Code: https://github.com/milad1378yz/MASPRM

MASPRM：マルチエージェントシステムプロセス報酬モデル

MASPRM: Multi-Agent System Process Reward Model

要旨

Support