Xwin-LM:LLM 模型的強大且可擴展的對齊實踐
Xwin-LM: Strong and Scalable Alignment Practice for LLMs
May 30, 2024
作者: Bolin Ni, JingCheng Hu, Yixuan Wei, Houwen Peng, Zheng Zhang, Gaofeng Meng, Han Hu
cs.AI
摘要
在這份工作中,我們提出了 Xwin-LM,一套針對大型語言模型(LLMs)的全面對齊方法。這套方法包含幾個關鍵技術,包括監督微調(SFT)、獎勵建模(RM)、拒絕抽樣微調(RS)和直接偏好優化(DPO)。主要組件如下:(1)Xwin-LM-SFT,最初使用高質量指導數據進行微調的模型;(2)Xwin-Pair,一個大規模、多輪偏好數據集,使用 GPT-4 精心注釋;(3)Xwin-RM,在 Xwin-Pair 上訓練的獎勵模型,規模分別為 7B、13B 和 70B 參數;(4)Xwin-Set,一個多方偏好數據集,其中每個提示與由 Xwin-LM-SFT 生成並由 Xwin-RM 評分的 64 個獨特回應相關聯;(5)Xwin-LM-RS,使用 Xwin-Set 中得分最高的回應進行微調的模型;(6)Xwin-LM-DPO,使用 DPO 算法在 Xwin-Set 上進行進一步優化的模型。我們在 AlpacaEval 和 MT-bench 上的評估表明,在整個流程中持續且顯著地改進,展示了 Xwin-LM 的強大性和可擴展性。該存儲庫 https://github.com/Xwin-LM/Xwin-LM 將持續更新以促進社區研究。
English
In this work, we present Xwin-LM, a comprehensive suite of alignment
methodologies for large language models (LLMs). This suite encompasses several
key techniques, including supervised finetuning (SFT), reward modeling (RM),
rejection sampling finetuning (RS), and direct preference optimization (DPO).
The key components are as follows: (1) Xwin-LM-SFT, models initially finetuned
with high-quality instruction data; (2) Xwin-Pair, a large-scale, multi-turn
preference dataset meticulously annotated using GPT-4; (3) Xwin-RM, reward
models trained on Xwin-Pair, developed at scales of 7B, 13B, and 70B
parameters; (4) Xwin-Set, a multiwise preference dataset in which each prompt
is linked to 64 unique responses generated by Xwin-LM-SFT and scored by
Xwin-RM; (5) Xwin-LM-RS, models finetuned with the highest-scoring responses
from Xwin-Set; (6) Xwin-LM-DPO, models further optimized on Xwin-Set using the
DPO algorithm. Our evaluations on AlpacaEval and MT-bench demonstrate
consistent and significant improvements across the pipeline, demonstrating the
strength and scalability of Xwin-LM. The repository
https://github.com/Xwin-LM/Xwin-LM will be continually updated to foster
community research.Summary
AI-Generated Summary