Xwin-LM:LLM 模型的强大且可扩展的对齐实践
Xwin-LM: Strong and Scalable Alignment Practice for LLMs
May 30, 2024
作者: Bolin Ni, JingCheng Hu, Yixuan Wei, Houwen Peng, Zheng Zhang, Gaofeng Meng, Han Hu
cs.AI
摘要
在这项工作中,我们提出了Xwin-LM,一个针对大型语言模型(LLMs)的全面对齐方法套件。该套件包括几种关键技术,包括监督微调(SFT)、奖励建模(RM)、拒绝抽样微调(RS)和直接偏好优化(DPO)。关键组成部分如下:(1)Xwin-LM-SFT,最初使用高质量指导数据微调的模型;(2)Xwin-Pair,一个大规模、多轮偏好数据集,使用GPT-4精心注释;(3)Xwin-RM,在Xwin-Pair上训练的奖励模型,规模为7B、13B和70B参数;(4)Xwin-Set,一个多智能偏好数据集,其中每个提示与由Xwin-LM-SFT生成的64个独特响应相关联,并由Xwin-RM评分;(5)Xwin-LM-RS,使用Xwin-Set中得分最高的响应微调的模型;(6)Xwin-LM-DPO,使用DPO算法在Xwin-Set上进一步优化的模型。我们在AlpacaEval和MT-bench上的评估显示,在整个流程中持续且显著改进,展示了Xwin-LM的强大性和可扩展性。存储库https://github.com/Xwin-LM/Xwin-LM将持续更新以促进社区研究。
English
In this work, we present Xwin-LM, a comprehensive suite of alignment
methodologies for large language models (LLMs). This suite encompasses several
key techniques, including supervised finetuning (SFT), reward modeling (RM),
rejection sampling finetuning (RS), and direct preference optimization (DPO).
The key components are as follows: (1) Xwin-LM-SFT, models initially finetuned
with high-quality instruction data; (2) Xwin-Pair, a large-scale, multi-turn
preference dataset meticulously annotated using GPT-4; (3) Xwin-RM, reward
models trained on Xwin-Pair, developed at scales of 7B, 13B, and 70B
parameters; (4) Xwin-Set, a multiwise preference dataset in which each prompt
is linked to 64 unique responses generated by Xwin-LM-SFT and scored by
Xwin-RM; (5) Xwin-LM-RS, models finetuned with the highest-scoring responses
from Xwin-Set; (6) Xwin-LM-DPO, models further optimized on Xwin-Set using the
DPO algorithm. Our evaluations on AlpacaEval and MT-bench demonstrate
consistent and significant improvements across the pipeline, demonstrating the
strength and scalability of Xwin-LM. The repository
https://github.com/Xwin-LM/Xwin-LM will be continually updated to foster
community research.Summary
AI-Generated Summary