事前学習済み方策識別器は汎用的な報酬モデルである

要旨

我々は、報酬モデリングに対して新たな視点を提供し、それをポリシー識別器として定式化することで、2つのポリシー間の差異を定量化し、報酬信号を生成し、訓練ポリシーを所望の行動を持つターゲットポリシーに向けて導く。この概念的洞察に基づき、我々はPolicy Discriminative Learning（POLAR）というスケーラブルな事前学習手法を提案する。POLARは、同一のポリシーを識別し、異なるポリシーを区別するために報酬モデル（RM）を訓練する。従来の絶対的な選好に依存する報酬モデリング手法とは異なり、POLARは1つのポリシーと任意のターゲットポリシーとの相対的な差異を捉え、汎用的な順位関係をモデル化するのに適したスケーラブルで高レベルの最適化目標を提供する。POLARの事前学習パラダイムを活用し、我々は1.8Bから7Bまでのパラメータスケールを持つ一連のRMを提示する。実験結果は、POLARが従来の非事前学習手法を大幅に上回り、RMの性能を著しく向上させることを示している。例えば、POLAR-7Bは、STEMタスクにおいて選好精度を54.8%から81.0%に、創造的ライティングタスクにおいて57.9%から85.5%に改善し、SOTAベースラインを凌駕した。また、POLARはReinforcement Fine-tuning（RFT）を用いたRLHFにおいても強力な汎化能力を示し、信頼性の高い報酬信号を提供し、ポリシーの性能を顕著に向上させた。具体的には、LLaMa3.1-8Bの平均性能を47.36%から56.33%に、Qwen2.5-32Bを64.49%から70.47%に改善した。さらに、スケーリング実験では、計算量と性能の間に明確なべき乗則関係が確認され、線形相関係数が0.99に近いことが示された。これらの印象的な性能、強力な汎化能力、およびスケーリング特性は、POLARが汎用的で強力な報酬モデルを開発するための有望な方向性であることを示唆している。

English

We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.

事前学習済み方策識別器は汎用的な報酬モデルである

Pre-Trained Policy Discriminators are General Reward Models

要旨

Support