직접 내쉬 최적화: 일반적 선호를 통해 언어 모델이 자기 개선하도록 가르치기

초록

본 논문은 강력한 오라클로부터의 선호 피드백을 활용하여 대형 언어 모델(LLM)을 사후 훈련시키고, 이를 통해 모델이 스스로 반복적으로 개선될 수 있도록 돕는 방법을 연구합니다. 일반적인 LLM 사후 훈련 접근법은 인간 피드백을 통한 강화학습(RLHF)을 포함하며, 이는 전통적으로 보상 학습과 이후의 정책 최적화를 분리합니다. 그러나 이러한 보상 최대화 접근법은 "점 단위" 보상(예: Bradley-Terry 모델)의 특성에 의해 제한되며, 복잡한 비이행적 또는 순환적 선호 관계를 표현하지 못합니다. RLHF의 발전으로 보상 학습과 정책 최적화가 단일 대조 목적 함수로 통합되어 안정성을 얻을 수 있게 되었지만, 여전히 보상 최대화 프레임워크에 얽매여 있습니다. 최근에는 "쌍 단위" 또는 일반적인 선호를 직접 최적화하는 방식으로 보상 최대화 가정을 우회하는 새로운 연구 흐름이 등장했습니다. 본 논문에서는 대조 학습의 단순성과 안정성을 일반 선호 최적화의 이론적 일반성과 결합한, 검증 가능하고 확장성 있는 알고리즘인 Direct Nash Optimization(DNO)을 소개합니다. DNO는 회귀 기반 목적 함수를 사용하는 배치 온-정책 알고리즘이므로 구현이 간단하고 효율적입니다. 또한 DNO는 반복을 통해 단조적 개선을 이루며, GPT-4와 같은 강력한 교사 모델보다도 더 나은 성능을 달성할 수 있습니다. 실험 결과, DNO로 정렬된 7B 파라미터 Orca-2.5 모델은 AlpacaEval 2.0에서 GPT-4-Turbo 대비 33%의 최신 상태의 승률을 기록했습니다(응답 길이를 통제한 후에도). 이는 초기 모델 대비 26%(7%에서 33%)의 절대적 성능 향상을 의미하며, Mistral Large, Self-Rewarding LM(70B 파라미터), 이전 버전의 GPT-4 등 훨씬 더 많은 파라미터를 가진 모델들을 능가했습니다.

English

This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy optimization. However, such a reward maximization approach is limited by the nature of "point-wise" rewards (such as Bradley-Terry model), which fails to express complex intransitive or cyclic preference relations. While advances on RLHF show reward learning and policy optimization can be merged into a single contrastive objective for stability, they yet still remain tethered to the reward maximization framework. Recently, a new wave of research sidesteps the reward maximization presumptions in favor of directly optimizing over "pair-wise" or general preferences. In this paper, we introduce Direct Nash Optimization (DNO), a provable and scalable algorithm that marries the simplicity and stability of contrastive learning with theoretical generality from optimizing general preferences. Because DNO is a batched on-policy algorithm using a regression-based objective, its implementation is straightforward and efficient. Moreover, DNO enjoys monotonic improvement across iterations that help it improve even over a strong teacher (such as GPT-4). In our experiments, a resulting 7B parameter Orca-2.5 model aligned by DNO achieves the state-of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaEval 2.0 (even after controlling for response length), an absolute gain of 26% (7% to 33%) over the initializing model. It outperforms models with far more parameters, including Mistral Large, Self-Rewarding LM (70B parameters), and older versions of GPT-4.

직접 내쉬 최적화: 일반적 선호를 통해 언어 모델이 자기 개선하도록 가르치기

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

초록

Support