Xwin-LM：LLM 模型的強大且可擴展的對齊實踐

摘要

在這份工作中，我們提出了 Xwin-LM，一套針對大型語言模型（LLMs）的全面對齊方法。這套方法包含幾個關鍵技術，包括監督微調（SFT）、獎勵建模（RM）、拒絕抽樣微調（RS）和直接偏好優化（DPO）。主要組件如下：（1）Xwin-LM-SFT，最初使用高質量指導數據進行微調的模型；（2）Xwin-Pair，一個大規模、多輪偏好數據集，使用 GPT-4 精心注釋；（3）Xwin-RM，在 Xwin-Pair 上訓練的獎勵模型，規模分別為 7B、13B 和 70B 參數；（4）Xwin-Set，一個多方偏好數據集，其中每個提示與由 Xwin-LM-SFT 生成並由 Xwin-RM 評分的 64 個獨特回應相關聯；（5）Xwin-LM-RS，使用 Xwin-Set 中得分最高的回應進行微調的模型；（6）Xwin-LM-DPO，使用 DPO 算法在 Xwin-Set 上進行進一步優化的模型。我們在 AlpacaEval 和 MT-bench 上的評估表明，在整個流程中持續且顯著地改進，展示了 Xwin-LM 的強大性和可擴展性。該存儲庫 https://github.com/Xwin-LM/Xwin-LM 將持續更新以促進社區研究。

English

In this work, we present Xwin-LM, a comprehensive suite of alignment methodologies for large language models (LLMs). This suite encompasses several key techniques, including supervised finetuning (SFT), reward modeling (RM), rejection sampling finetuning (RS), and direct preference optimization (DPO). The key components are as follows: (1) Xwin-LM-SFT, models initially finetuned with high-quality instruction data; (2) Xwin-Pair, a large-scale, multi-turn preference dataset meticulously annotated using GPT-4; (3) Xwin-RM, reward models trained on Xwin-Pair, developed at scales of 7B, 13B, and 70B parameters; (4) Xwin-Set, a multiwise preference dataset in which each prompt is linked to 64 unique responses generated by Xwin-LM-SFT and scored by Xwin-RM; (5) Xwin-LM-RS, models finetuned with the highest-scoring responses from Xwin-Set; (6) Xwin-LM-DPO, models further optimized on Xwin-Set using the DPO algorithm. Our evaluations on AlpacaEval and MT-bench demonstrate consistent and significant improvements across the pipeline, demonstrating the strength and scalability of Xwin-LM. The repository https://github.com/Xwin-LM/Xwin-LM will be continually updated to foster community research.

Xwin-LM：LLM 模型的強大且可擴展的對齊實踐

Xwin-LM: Strong and Scalable Alignment Practice for LLMs

摘要

Summary

Support

Support