ChatPaper.aiChatPaper

學習您的參考模型以實現良好的對齊。

Learn Your Reference Model for Real Good Alignment

April 15, 2024
作者: Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, Daniil Gavrilov
cs.AI

摘要

對齊問題的複雜性源於現有方法的不穩定性。研究人員不斷創造各種技巧來解決這個缺點。例如,在人類反饋強化學習(RLHF)技術中的語言模型對齊的基本方法中,除了獎勵最大化外,還會最小化可訓練策略與SFT策略之間的Kullback-Leibler散度。這個添加防止了模型過度擬合於獎勵模型(RM)並生成對於RM來說屬於非領域的文本。直接偏好優化(DPO)方法重新制定了RLHF的優化任務,並消除了獎勵模型,同時默默地保持了策略接近SFT策略的要求。在我們的論文中,我們認為DPO方法中的這種隱含限制導致次優結果。我們提出了一種名為信任區域DPO(TR-DPO)的新方法,該方法在訓練過程中更新參考策略。通過這種直接的更新,我們展示了TR-DPO相對於DPO在Anthropic HH和TLDR數據集上的有效性。我們展示了TR-DPO在GPT-4的自動評估中比DPO表現優異,最高可達19%。我們提出的新對齊方法使我們能夠同時改善模型在多個參數上的質量,如連貫性、正確性、細節水平、幫助性和無害性。
English
The complexity of the alignment problem stems from the fact that existing methods are unstable. Researchers continuously invent various tricks to address this shortcoming. For instance, in the fundamental Reinforcement Learning From Human Feedback (RLHF) technique of Language Model alignment, in addition to reward maximization, the Kullback-Leibler divergence between the trainable policy and the SFT policy is minimized. This addition prevents the model from being overfitted to the Reward Model (RM) and generating texts that are out-of-domain for the RM. The Direct Preference Optimization (DPO) method reformulates the optimization task of RLHF and eliminates the Reward Model while tacitly maintaining the requirement for the policy to be close to the SFT policy. In our paper, we argue that this implicit limitation in the DPO method leads to sub-optimal results. We propose a new method called Trust Region DPO (TR-DPO), which updates the reference policy during training. With such a straightforward update, we demonstrate the effectiveness of TR-DPO against DPO on the Anthropic HH and TLDR datasets. We show that TR-DPO outperforms DPO by up to 19%, measured by automatic evaluation with GPT-4. The new alignment approach that we propose allows us to improve the quality of models across several parameters at once, such as coherence, correctness, level of detail, helpfulness, and harmlessness.

Summary

AI-Generated Summary

PDF870December 15, 2024