사전 훈련된 정책 판별기는 일반적인 보상 모델로 사용될 수 있다.

초록

본 연구에서는 보상 모델링에 대한 새로운 관점을 제시하며, 이를 정책 판별기로 공식화하여 두 정책 간의 차이를 정량화하고 이를 통해 보상 신호를 생성함으로써, 훈련 정책이 원하는 행동을 보이는 목표 정책으로 향하도록 유도합니다. 이러한 개념적 통찰을 바탕으로, 우리는 정책 판별 학습(Policy Discriminative Learning, POLAR)이라는 확장 가능한 사전 훈련 방법을 제안합니다. POLAR는 보상 모델(Reward Model, RM)을 훈련시켜 동일한 정책을 식별하고 다른 정책을 구별하도록 합니다. 절대적 선호도에 의존하는 전통적인 보상 모델링 방법과 달리, POLAR는 하나의 정책과 임의의 목표 정책 간의 상대적 차이를 포착하며, 이는 일반적인 순위 관계를 모델링하기에 적합한 확장 가능한 고수준 최적화 목표입니다. POLAR 사전 훈련 패러다임을 활용하여, 우리는 1.8B에서 7B에 이르는 다양한 파라미터 규모의 RM 시리즈를 제시합니다. 실험 결과는 POLAR가 전통적인 사전 훈련되지 않은 방법들을 크게 능가하며, RM 성능을 크게 향상시킴을 보여줍니다. 예를 들어, POLAR-7B는 STEM 작업에서 선호도 정확도를 54.8%에서 81.0%로, 창의적 글쓰기 작업에서는 57.9%에서 85.5%로 향상시켰습니다. 또한 POLAR는 강화 미세 조정(Reinforcement Fine-tuning, RFT)을 사용한 RLHF에서도 강력한 일반화 능력을 보이며, 신뢰할 수 있는 보상 신호를 제공하고 정책 성능을 크게 향상시켰습니다. LLaMa3.1-8B의 평균 성능을 47.36%에서 56.33%로, Qwen2.5-32B의 성능을 64.49%에서 70.47%로 개선했습니다. 더불어, 스케일링 실험은 계산과 성능 간의 명확한 멱법칙 관계를 보여주며, 선형 상관 계수가 0.99에 근접함을 확인했습니다. 이러한 인상적인 성능, 강력한 일반화 능력, 그리고 스케일링 특성은 POLAR가 일반적이고 강력한 보상 모델을 개발하기 위한 유망한 방향임을 시사합니다.

English

We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.

사전 훈련된 정책 판별기는 일반적인 보상 모델로 사용될 수 있다.

Pre-Trained Policy Discriminators are General Reward Models

초록

Support