Xwin-LM：LLM 模型的强大且可扩展的对齐实践

摘要

在这项工作中，我们提出了Xwin-LM，一个针对大型语言模型（LLMs）的全面对齐方法套件。该套件包括几种关键技术，包括监督微调（SFT）、奖励建模（RM）、拒绝抽样微调（RS）和直接偏好优化（DPO）。关键组成部分如下：（1）Xwin-LM-SFT，最初使用高质量指导数据微调的模型；（2）Xwin-Pair，一个大规模、多轮偏好数据集，使用GPT-4精心注释；（3）Xwin-RM，在Xwin-Pair上训练的奖励模型，规模为7B、13B和70B参数；（4）Xwin-Set，一个多智能偏好数据集，其中每个提示与由Xwin-LM-SFT生成的64个独特响应相关联，并由Xwin-RM评分；（5）Xwin-LM-RS，使用Xwin-Set中得分最高的响应微调的模型；（6）Xwin-LM-DPO，使用DPO算法在Xwin-Set上进一步优化的模型。我们在AlpacaEval和MT-bench上的评估显示，在整个流程中持续且显著改进，展示了Xwin-LM的强大性和可扩展性。存储库https://github.com/Xwin-LM/Xwin-LM将持续更新以促进社区研究。

English

In this work, we present Xwin-LM, a comprehensive suite of alignment methodologies for large language models (LLMs). This suite encompasses several key techniques, including supervised finetuning (SFT), reward modeling (RM), rejection sampling finetuning (RS), and direct preference optimization (DPO). The key components are as follows: (1) Xwin-LM-SFT, models initially finetuned with high-quality instruction data; (2) Xwin-Pair, a large-scale, multi-turn preference dataset meticulously annotated using GPT-4; (3) Xwin-RM, reward models trained on Xwin-Pair, developed at scales of 7B, 13B, and 70B parameters; (4) Xwin-Set, a multiwise preference dataset in which each prompt is linked to 64 unique responses generated by Xwin-LM-SFT and scored by Xwin-RM; (5) Xwin-LM-RS, models finetuned with the highest-scoring responses from Xwin-Set; (6) Xwin-LM-DPO, models further optimized on Xwin-Set using the DPO algorithm. Our evaluations on AlpacaEval and MT-bench demonstrate consistent and significant improvements across the pipeline, demonstrating the strength and scalability of Xwin-LM. The repository https://github.com/Xwin-LM/Xwin-LM will be continually updated to foster community research.

Xwin-LM：LLM 模型的强大且可扩展的对齐实践

Xwin-LM: Strong and Scalable Alignment Practice for LLMs

摘要

Support