使用DPO隱式獎勵來啟動語言模型

摘要

在大型語言模型（LLMs）中進行人類對齊是一個活躍的研究領域。最近一項開創性的工作，即直接偏好優化（DPO），通過繞過強化學習中的獎勵學習階段，大大簡化了過去在從人類反饋中進行強化學習（RLHF）方面的工作。DPO 在訓練後提供了一個隱式獎勵模型。在這項工作中，我們做出了一個新穎的觀察，即這個隱式獎勵模型本身可以被用來進一步對齊LLM。我們的方法是使用當前LLM模型的獎勵來構建一個偏好數據集，然後在後續的DPO回合中使用。我們加入了一些改進，消除了回應長度的偏見，提高了偏好數據集的質量，進一步改進了我們的方法。我們的方法被稱為使用DPO隱式獎勵的自我對齊（DICE），在對齊方面取得了巨大進展，並在AlpacaEval 2上實現了比Gemini Pro更優異的性能，對抗GPT-4 Turbo的控制長度勝率達到27.55%，但僅使用了8B參數並且沒有外部反饋。我們的代碼可在https://github.com/sail-sg/dice 找到。

English

Human alignment in large language models (LLMs) is an active area of research. A recent groundbreaking work, direct preference optimization (DPO), has greatly simplified the process from past work in reinforcement learning from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO, after training, provides an implicit reward model. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach is to use the rewards from a current LLM model to construct a preference dataset, which is then used in subsequent DPO rounds. We incorporate refinements that debias the length of the responses and improve the quality of the preference dataset to further improve our approach. Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment and achieves superior performance than Gemini Pro on AlpacaEval 2, reaching 27.55% length-controlled win rate against GPT-4 Turbo, but with only 8B parameters and no external feedback. Our code is available at https://github.com/sail-sg/dice.

使用DPO隱式獎勵來啟動語言模型

Bootstrapping Language Models with DPO Implicit Rewards

摘要

Support