预训练策略判别器即通用奖励模型

摘要

我们提出了一种新颖的奖励建模视角，将其表述为一种策略判别器，通过量化两种策略之间的差异来生成奖励信号，从而引导训练策略向具有期望行为的目标策略靠拢。基于这一概念性洞察，我们提出了一种名为策略判别学习（POLAR）的可扩展预训练方法，该方法训练奖励模型（RM）以识别相同策略并区分不同策略。与依赖绝对偏好的传统奖励建模方法不同，POLAR捕捉了一种策略与任意目标策略之间的相对差异，这是一种适合建模通用排序关系的可扩展、高层次优化目标。利用POLAR预训练范式，我们推出了一系列参数规模从1.8B到7B的奖励模型。实证结果表明，POLAR显著优于传统的非预训练方法，大幅提升了奖励模型的性能。例如，与最先进的基线相比，POLAR-7B在STEM任务上的偏好准确率从54.8%提升至81.0%，在创意写作任务上从57.9%提升至85.5%。POLAR在使用强化微调（RFT）的RLHF中也展现出强大的泛化能力，提供了可靠的奖励信号，并显著提升了策略性能——在20个基准测试中，LLaMa3.1-8B的平均表现从47.36%提升至56.33%，Qwen2.5-32B从64.49%提升至70.47%。此外，扩展实验揭示了计算与性能之间明显的幂律关系，线性相关系数接近0.99。POLAR的卓越性能、强大泛化能力及扩展特性表明，它是开发通用且强大奖励模型的一个有前景的方向。

English

We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.

预训练策略判别器即通用奖励模型

Pre-Trained Policy Discriminators are General Reward Models

摘要

Support