RLHF를 위한 데이터셋 리셋 정책 최적화

초록

인간 선호도 기반 피드백을 통한 강화 학습(Reinforcement Learning from Human Preference-based feedback, RLHF)은 생성 모델을 미세 조정하기 위한 널리 사용되는 패러다임으로, GPT-4와 Claude3 Opus와 같은 인상적인 모델들을 만들어냈습니다. 이 프레임워크는 일반적으로 두 단계로 구성됩니다: 오프라인 선호도 데이터셋으로부터 보상 모델을 학습한 후, 학습된 보상 모델을 최적화하기 위해 온라인 강화 학습을 실행하는 것입니다. 본 연구에서는 리셋(reset) 개념을 활용하여 이론적 보장이 가능한 새로운 RLHF 알고리즘을 제안합니다. 오프라인 선호도 데이터셋이 라벨러가 선호하는 정보성 있는 상태(즉, 데이터)를 제공한다는 사실에 착안하여, 우리의 새로운 알고리즘인 데이터셋 리셋 정책 최적화(Dataset Reset Policy Optimization, DR-PO)는 기존 오프라인 선호도 데이터셋을 온라인 정책 학습 과정에 통합합니다. 이는 초기 상태 분포에서 항상 시작하는 대신, 정책 최적화기를 오프라인 데이터셋의 상태로 직접 리셋하는 방식입니다. 이론적으로, DR-PO는 유한한 샘플 복잡도 하에서 일반 함수 근사를 통해 오프라인 데이터셋이 커버하는 모든 정책 이상의 성능을 학습할 수 있음을 보입니다. 실험에서는 TL;DR 요약 작업과 Anthropic Helpful Harmful(HH) 데이터셋에서 DR-PO가 생성한 결과가 GPT4 승률 지표 하에서 Proximal Policy Optimization(PPO) 및 Direction Preference Optimization(DPO)보다 우수함을 입증했습니다. 본 연구의 코드는 https://github.com/Cornell-RL/drpo에서 확인할 수 있습니다.

English

Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at https://github.com/Cornell-RL/drpo.

RLHF를 위한 데이터셋 리셋 정책 최적화

Dataset Reset Policy Optimization for RLHF

초록

Support