学习您的参考模型,以实现良好的对齐。
Learn Your Reference Model for Real Good Alignment
April 15, 2024
作者: Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, Daniil Gavrilov
cs.AI
摘要
对齐问题的复杂性源于现有方法的不稳定性。研究人员不断发明各种技巧来解决这一缺点。例如,在基础的人类反馈强化学习(RLHF)技术中,除了奖励最大化外,可训练策略与SFT策略之间的Kullback-Leibler散度被最小化。这一补充防止模型过度拟合奖励模型(RM)并生成对RM来说属于域外的文本。直接偏好优化(DPO)方法重新制定了RLHF的优化任务,并消除了奖励模型,同时暗中保持了策略接近SFT策略的要求。在我们的论文中,我们认为DPO方法中的这种隐含限制导致次优结果。我们提出了一种名为信任区域DPO(TR-DPO)的新方法,该方法在训练过程中更新参考策略。通过这种直接的更新,我们展示了TR-DPO相对于DPO在Anthropic HH和TLDR数据集上的有效性。我们展示了TR-DPO在GPT-4的自动评估中比DPO表现优越高达19%。我们提出的新的对齐方法使我们能够同时改善模型在多个参数上的质量,如连贯性、正确性、细节水平、有用性和无害性。
English
The complexity of the alignment problem stems from the fact that existing
methods are unstable. Researchers continuously invent various tricks to address
this shortcoming. For instance, in the fundamental Reinforcement Learning From
Human Feedback (RLHF) technique of Language Model alignment, in addition to
reward maximization, the Kullback-Leibler divergence between the trainable
policy and the SFT policy is minimized. This addition prevents the model from
being overfitted to the Reward Model (RM) and generating texts that are
out-of-domain for the RM. The Direct Preference Optimization (DPO) method
reformulates the optimization task of RLHF and eliminates the Reward Model
while tacitly maintaining the requirement for the policy to be close to the SFT
policy. In our paper, we argue that this implicit limitation in the DPO method
leads to sub-optimal results. We propose a new method called Trust Region DPO
(TR-DPO), which updates the reference policy during training. With such a
straightforward update, we demonstrate the effectiveness of TR-DPO against DPO
on the Anthropic HH and TLDR datasets. We show that TR-DPO outperforms DPO by
up to 19%, measured by automatic evaluation with GPT-4. The new alignment
approach that we propose allows us to improve the quality of models across
several parameters at once, such as coherence, correctness, level of detail,
helpfulness, and harmlessness.Summary
AI-Generated Summary