SPARK：政策と報酬の共進化を促進するシナジスティックフレームワーク

要旨

近年の大規模言語モデル（LLM）および大規模視覚言語モデル（LVLM）では、強化学習（RL）をポスト事前学習に活用するケースが増えている。具体的には、客観的タスクに対して検証可能な報酬を用いたRL（RLVR）や、主観的タスクに対して人間のフィードバックを用いたRL（RLHF）が挙げられる。しかし、RLHFは人間の選好に依存するため高コストであり、報酬とポリシーのミスマッチが生じる可能性がある。一方、RLVRは各更新後にロールアウトと正解信号を破棄するため、監督情報を無駄にしている。これらの課題に対処するため、我々はRLVRを基盤とした効率的でオンラインかつ安定した手法である「Synergistic Policy And Reward Co-Evolving Framework（SPARK）」を提案する。SPARKは、ロールアウトと正解データを破棄する代わりに、これらの貴重な情報を再利用し、モデル自体を生成型報酬モデルとして同時に訓練する。この補助的な訓練では、ポイントワイズ報酬スコア、ペアワイズ比較、さらなる考察に基づく評価といった複数の目的を組み合わせて、モデルに自身の応答を評価し改善する能力を教える。このプロセスにより、別個の報酬モデルや高コストな人間の選好データが不要となる。SPARKは、報酬精度の向上がより良いポリシー勾配を生み出し、それによって高品質なロールアウトが生成され、さらに報酬モデルが洗練されるという正の共進化フィードバックループを形成する。我々の統合フレームワークは、外部の報酬モデルやそれに伴うコストを必要とせず、自己反省を通じたテスト時のスケーリングをサポートする。SPARKは、複数のLLMおよびLVLMモデルにおいて、複数の推論、報酬モデル、および一般的なベンチマークで顕著な性能向上を達成する。例えば、SPARK-VL-7Bは、7つの推論ベンチマークで平均9.7%、2つの報酬ベンチマークで12.1%、8つの一般的なベンチマークで1.5%のベースラインを上回る性能を示し、堅牢性と広範な汎化能力を実証している。

English

Recent Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) increasingly use Reinforcement Learning (RL) for post-pretraining, such as RL with Verifiable Rewards (RLVR) for objective tasks and RL from Human Feedback (RLHF) for subjective tasks. However, RLHF incurs high costs and potential reward-policy mismatch due to reliance on human preferences, while RLVR still wastes supervision by discarding rollouts and correctness signals after each update. To address these challenges, we introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR. Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model. This auxiliary training uses a mix of objectives, such as pointwise reward score, pairwise comparison, and evaluation conditioned on further-reflection responses, to teach the model to evaluate and improve its own responses. Our process eliminates the need for a separate reward model and costly human preference data. SPARK creates a positive co-evolving feedback loop: improved reward accuracy yields better policy gradients, which in turn produce higher-quality rollouts that further refine the reward model. Our unified framework supports test-time scaling via self-reflection without external reward models and their associated costs. We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks. For example, SPARK-VL-7B achieves an average 9.7% gain on 7 reasoning benchmarks, 12.1% on 2 reward benchmarks, and 1.5% on 8 general benchmarks over the baselines, demonstrating robustness and broad generalization.

SPARK：政策と報酬の共進化を促進するシナジスティックフレームワーク

SPARK: Synergistic Policy And Reward Co-Evolving Framework

要旨

Support