SPC: 大規模言語モデルの推論のための敵対的ゲームによる自己対戦批評家の進化

要旨

大規模言語モデル（LLM）の推論ステップごとの信頼性、例えばChain-of-Thought（思考の連鎖）を評価することは、高品質なステップレベルの監視を取得する難しさとコストのため、依然として困難です。本論文では、Self-Play Critic（SPC）という新しいアプローチを紹介します。SPCでは、批評モデルが敵対的な自己プレイゲームを通じて推論ステップを評価する能力を進化させ、手動のステップレベルのアノテーションを不要にします。SPCは、ベースモデルの2つのコピーを微調整して、2つの役割を果たすようにします。具体的には、検出が難しいように意図的に誤ったステップを生成する「sneaky generator（狡猾な生成器）」と、推論ステップの正しさを分析する「critic（批評家）」です。これら2つのモデルは、生成器が批評家を欺こうとし、批評家が生成器の誤りを見つけようとする敵対的なゲームに参加します。ゲームの結果に基づく強化学習を使用して、モデルは反復的に改善されます。各対決の勝者は正の報酬を受け、敗者は負の報酬を受け、これにより継続的な自己進化が促進されます。3つの推論プロセスベンチマーク（ProcessBench、PRM800K、DeltaBench）での実験により、SPCがエラー検出能力を段階的に向上させることが示されました（例えば、ProcessBenchでの精度が70.8%から77.7%に向上）。また、SPCは蒸留されたR1モデルを含む強力なベースラインを上回りました。さらに、SPCを多様なLLMのテストタイム検索に適用することで、MATH500とAIME2024での数学的推論性能が大幅に向上し、最先端のプロセス報酬モデルを凌駕しました。

English

Evaluating the step-by-step reliability of large language model (LLM) reasoning, such as Chain-of-Thought, remains challenging due to the difficulty and cost of obtaining high-quality step-level supervision. In this paper, we introduce Self-Play Critic (SPC), a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games, eliminating the need for manual step-level annotation. SPC involves fine-tuning two copies of a base model to play two roles, namely a "sneaky generator" that deliberately produces erroneous steps designed to be difficult to detect, and a "critic" that analyzes the correctness of reasoning steps. These two models engage in an adversarial game in which the generator aims to fool the critic, while the critic model seeks to identify the generator's errors. Using reinforcement learning based on the game outcomes, the models iteratively improve; the winner of each confrontation receives a positive reward and the loser receives a negative reward, driving continuous self-evolution. Experiments on three reasoning process benchmarks (ProcessBench, PRM800K, DeltaBench) demonstrate that our SPC progressively enhances its error detection capabilities (e.g., accuracy increases from 70.8% to 77.7% on ProcessBench) and surpasses strong baselines, including distilled R1 model. Furthermore, applying SPC to guide the test-time search of diverse LLMs significantly improves their mathematical reasoning performance on MATH500 and AIME2024, outperforming state-of-the-art process reward models.

SPC: 大規模言語モデルの推論のための敵対的ゲームによる自己対戦批評家の進化

SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning

要旨

Support