调制干预偏好优化(MIPO):保留简单之处,优化困难之处。
Modulated Intervention Preference Optimization (MIPO): Keep the Easy, Refine the Difficult
September 26, 2024
作者: Cheolhun Jang
cs.AI
摘要
偏好优化方法通常从一个经过充分训练的SFT模型作为参考模型开始训练。在RLHF和DPO中,在偏好优化过程中使用正则化项,以防止策略模型偏离太远参考模型的分布,从而避免生成异常响应。当参考模型已经与给定数据很好地对齐或仅需要轻微调整时,这种方法可以产生一个良好对齐的模型。然而,如果参考模型与给定数据不对齐并且需要从当前状态显著偏离,正则化项实际上可能会阻碍模型对齐。在本研究中,我们提出了调制干预偏好优化(MIPO)来解决这个问题。MIPO根据给定数据与参考模型对齐程度调节干预程度。如果数据对齐良好,则增加干预以防止策略模型与参考模型显著偏离。相反,如果对齐性较差,则减少干预以促进更广泛的训练。我们使用Mistral-7B和Llama3-8B在Alpaca Eval 2.0和MT-Bench上比较MIPO和DPO的性能。实验结果表明,在各种评估场景中,MIPO始终优于DPO。
English
Preference optimization methods typically begin training with a well-trained
SFT model as a reference model. In RLHF and DPO, a regularization term is used
during the preference optimization process to prevent the policy model from
deviating too far from the reference model's distribution, thereby avoiding the
generation of anomalous responses. When the reference model is already
well-aligned with the given data or only requires slight adjustments, this
approach can produce a well-aligned model. However, if the reference model is
not aligned with the given data and requires significant deviation from its
current state, a regularization term may actually hinder the model alignment.
In this study, we propose Modulated Intervention Preference
Optimization (MIPO) to address this issue. MIPO modulates the degree of
intervention from the reference model based on how well the given data is
aligned with it. If the data is well-aligned, the intervention is increased to
prevent the policy model from diverging significantly from reference model.
Conversely, if the alignment is poor, the interference is reduced to facilitate
more extensive training. We compare the performance of MIPO and DPO using
Mistral-7B and Llama3-8B in Alpaca Eval 2.0 and MT-Bench. The experimental
results demonstrate that MIPO consistently outperforms DPO across various
evaluation scenarios.Summary
AI-Generated Summary