인간 피드백을 통한 내쉬 학습

초록

인간 피드백을 통한 강화 학습(Reinforcement Learning from Human Feedback, RLHF)은 대규모 언어 모델(Large Language Models, LLMs)을 인간의 선호도에 맞추기 위한 주요 패러다임으로 부상했다. 일반적으로 RLHF는 사전 학습된 LLM이 생성한 텍스트 쌍 간의 선호도로 표현된 인간 피드백으로부터 보상 모델을 학습하는 초기 단계를 포함한다. 이후, 강화 학습 알고리즘을 통해 보상 모델을 최대화하도록 LLM의 정책을 미세 조정한다. 그러나 현재의 보상 모델은 인간 선호도의 풍부함을 완전히 표현하지 못하고 샘플링 분포에 의존한다는 본질적인 한계를 지닌다. 본 연구에서는 쌍별 인간 피드백을 활용하여 LLM을 미세 조정하기 위한 대안적인 파이프라인을 제안한다. 우리의 접근법은 프롬프트가 주어졌을 때 두 입력을 조건으로 하는 선호 모델을 초기에 학습한 후, 경쟁 정책이 생성한 응답보다 선호되는 응답을 일관되게 생성하는 정책을 추구함으로써 이 선호 모델의 내쉬 균형(Nash equilibrium)을 정의한다. 우리는 이 접근법을 인간 피드백을 통한 내쉬 학습(Nash Learning from Human Feedback, NLHF)이라 명명한다. 표 형식의 정책 표현을 맥락에서, 우리는 미러 디센트(mirror descent) 원칙에 기반한 새로운 알고리즘 솔루션인 Nash-MD를 제시한다. 이 알고리즘은 정책의 시퀀스를 생성하며, 마지막 반복은 정규화된 내쉬 균형으로 수렴한다. 또한, 우리는 정책의 파라미터적 표현을 탐구하고 딥러닝 아키텍처를 위한 경사 하강법 알고리즘을 소개한다. 우리의 접근법의 효과를 입증하기 위해, 텍스트 요약 작업을 위한 LLM 미세 조정과 관련된 실험 결과를 제시한다. 우리는 NLHF가 인간 선호도에 맞춰 LLM을 정렬하는 분야를 발전시킬 잠재력을 지닌 선호 학습 및 정책 최적화를 위한 매력적인 방안을 제공한다고 믿는다.

English

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.