預訓練策略判別器作為通用獎勵模型

摘要

我們提出了一種新穎的獎勵建模視角，將其形式化為策略鑑別器，該鑑別器量化兩種策略之間的差異以生成獎勵信號，從而引導訓練策略向具有期望行為的目標策略靠攏。基於這一概念洞察，我們提出了一種名為策略鑑別學習（POLAR）的可擴展預訓練方法，該方法訓練獎勵模型（RM）來識別相同策略並區分不同策略。與依賴絕對偏好的傳統獎勵建模方法不同，POLAR捕捉了一種策略與任意目標策略之間的相對差異，這是一種適合建模通用排序關係的可擴展高層次優化目標。利用POLAR預訓練範式，我們展示了一系列參數規模從1.8B到7B的RM。實證結果顯示，POLAR顯著優於傳統的非預訓練方法，大幅提升了RM的性能。例如，與SOTA基線相比，POLAR-7B在STEM任務上的偏好準確率從54.8%提升至81.0%，在創意寫作任務上從57.9%提升至85.5%。POLAR在使用強化微調（RFT）的RLHF中也展現出強大的泛化能力，提供可靠的獎勵信號並顯著提升策略性能——在20個基準測試中，LLaMa3.1-8B的平均性能從47.36%提升至56.33%，Qwen2.5-32B從64.49%提升至70.47%。此外，擴展實驗揭示了計算與性能之間明顯的冪律關係，線性相關係數接近0.99。這些令人印象深刻的性能、強大的泛化能力以及擴展特性表明，POLAR是開發通用且強大獎勵模型的一個有前景的方向。

English

We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.

預訓練策略判別器作為通用獎勵模型

Pre-Trained Policy Discriminators are General Reward Models

摘要

Support