DPO 암시적 보상을 활용한 언어 모델 부트스트래핑

초록

대규모 언어 모델(LLM)에서의 인간 정렬(human alignment)은 활발히 연구되고 있는 분야입니다. 최근 획기적인 연구인 직접 선호 최적화(Direct Preference Optimization, DPO)는 인간 피드백을 통한 강화 학습(Reinforcement Learning from Human Feedback, RLHF)의 과정을 크게 단순화하여, RLHF의 보상 학습 단계를 생략했습니다. DPO는 학습 후에 암묵적인 보상 모델을 제공합니다. 본 연구에서는 이 암묵적인 보상 모델이 그 자체로 부트스트래핑 방식으로 LLM을 추가적으로 정렬하는 데 사용될 수 있다는 새로운 관찰을 제시합니다. 우리의 접근 방식은 현재 LLM 모델의 보상을 사용하여 선호 데이터셋을 구성하고, 이를 후속 DPO 라운드에서 사용하는 것입니다. 또한, 응답 길이에 대한 편향을 제거하고 선호 데이터셋의 품질을 개선하여 우리의 접근 방식을 더욱 향상시켰습니다. 우리는 이 접근 방식을 DPO 암묵적 보상을 통한 자기 정렬(Self-Alignment with DPO ImpliCit rEwards, DICE)로 명명했습니다. DICE는 정렬 측면에서 큰 개선을 보였으며, AlpacaEval 2에서 Gemini Pro를 능가하는 성능을 달성했습니다. 특히, GPT-4 Turbo 대비 27.55%의 길이 제어 승률을 기록했으며, 이는 단 8B 파라미터와 외부 피드백 없이 달성한 결과입니다. 우리의 코드는 https://github.com/sail-sg/dice에서 확인할 수 있습니다.

English

Human alignment in large language models (LLMs) is an active area of research. A recent groundbreaking work, direct preference optimization (DPO), has greatly simplified the process from past work in reinforcement learning from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO, after training, provides an implicit reward model. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach is to use the rewards from a current LLM model to construct a preference dataset, which is then used in subsequent DPO rounds. We incorporate refinements that debias the length of the responses and improve the quality of the preference dataset to further improve our approach. Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment and achieves superior performance than Gemini Pro on AlpacaEval 2, reaching 27.55% length-controlled win rate against GPT-4 Turbo, but with only 8B parameters and no external feedback. Our code is available at https://github.com/sail-sg/dice.

DPO 암시적 보상을 활용한 언어 모델 부트스트래핑

Bootstrapping Language Models with DPO Implicit Rewards

초록

Support