대조적 선호 학습: 강화 학습 없이 인간 피드백으로부터 학습하기

초록

인간 피드백을 통한 강화 학습(RLHF)은 모델을 인간의 의도에 맞추기 위한 인기 있는 패러다임으로 부상했습니다. 일반적으로 RLHF 알고리즘은 두 단계로 작동합니다: 첫째, 인간의 선호도를 사용하여 보상 함수를 학습하고, 둘째, 강화 학습(RL)을 통해 학습된 보상을 최적화하여 모델을 정렬합니다. 이 패러다임은 인간의 선호도가 보상에 따라 분포한다고 가정하지만, 최근 연구에 따르면 인간의 선호도는 사용자의 최적 정책 하에서의 후회(regret)를 따르는 것으로 나타났습니다. 따라서 피드백으로부터 보상 함수를 학습하는 것은 인간 선호도에 대한 잘못된 가정에 기반할 뿐만 아니라, 정책 그래디언트나 RL 단계의 부트스트래핑에서 비롯된 복잡한 최적화 문제를 야기합니다. 이러한 최적화 문제로 인해, 현대의 RLHF 방법들은 문맥적 밴딧 설정(예: 대형 언어 모델)으로 제한되거나 관측 차원을 제한(예: 상태 기반 로보틱스)합니다. 우리는 이러한 한계를 극복하기 위해 인간 선호도의 후회 기반 모델을 사용하여 인간 피드백으로부터 행동을 최적화하는 새로운 알고리즘 패밀리를 소개합니다. 최대 엔트로피 원리를 사용하여, 우리는 보상 함수를 학습하지 않고도 선호도로부터 최적 정책을 학습하는 대조적 선호 학습(CPL) 알고리즘을 도출했습니다. 이는 RL의 필요성을 우회합니다. CPL은 완전히 오프-정책이며, 단순한 대조적 목적 함수만을 사용하며, 임의의 MDP에 적용할 수 있습니다. 이를 통해 CPL은 이전 방법들보다 간단하면서도 고차원 및 순차적 RLHF 문제로 우아하게 확장될 수 있습니다.

English

Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for aligning models with human intent. Typically RLHF algorithms operate in two phases: first, use human preferences to learn a reward function and second, align the model by optimizing the learned reward via reinforcement learning (RL). This paradigm assumes that human preferences are distributed according to reward, but recent work suggests that they instead follow the regret under the user's optimal policy. Thus, learning a reward function from feedback is not only based on a flawed assumption of human preference, but also leads to unwieldy optimization challenges that stem from policy gradients or bootstrapping in the RL phase. Because of these optimization challenges, contemporary RLHF methods restrict themselves to contextual bandit settings (e.g., as in large language models) or limit observation dimensionality (e.g., state-based robotics). We overcome these limitations by introducing a new family of algorithms for optimizing behavior from human feedback using the regret-based model of human preferences. Using the principle of maximum entropy, we derive Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions, circumventing the need for RL. CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs. This enables CPL to elegantly scale to high-dimensional and sequential RLHF problems while being simpler than prior methods.

대조적 선호 학습: 강화 학습 없이 인간 피드백으로부터 학습하기

Contrastive Prefence Learning: Learning from Human Feedback without RL

초록

Support