Xwin-LM: 대규모 언어 모델을 위한 강력하고 확장 가능한 정렬 실천법

초록

본 연구에서는 대규모 언어 모델(LLM)을 위한 포괄적인 정렬 방법론 모음인 Xwin-LM을 소개합니다. 이 모음은 지도 미세조정(SFT), 보상 모델링(RM), 거부 샘플링 미세조정(RS), 직접 선호도 최적화(DPO) 등 여러 핵심 기술을 포함합니다. 주요 구성 요소는 다음과 같습니다: (1) 고품질 명령어 데이터로 초기 미세조정된 Xwin-LM-SFT 모델; (2) GPT-4를 사용해 정밀하게 주석 처리된 대규모 다중 턴 선호도 데이터셋인 Xwin-Pair; (3) Xwin-Pair로 학습된 7B, 13B, 70B 매개변수 규모의 보상 모델인 Xwin-RM; (4) 각 프롬프트가 Xwin-LM-SFT로 생성된 64개의 고유 응답과 Xwin-RM으로 점수 매겨진 다중 선호도 데이터셋인 Xwin-Set; (5) Xwin-Set에서 가장 높은 점수를 받은 응답으로 미세조정된 Xwin-LM-RS 모델; (6) DPO 알고리즘을 사용해 Xwin-Set에서 추가 최적화된 Xwin-LM-DPO 모델. AlpacaEval과 MT-bench에서의 평가는 파이프라인 전반에 걸쳐 일관적이고 상당한 개선을 보여주며, Xwin-LM의 강점과 확장성을 입증합니다. 커뮤니티 연구를 촉진하기 위해 https://github.com/Xwin-LM/Xwin-LM 저장소는 지속적으로 업데이트될 예정입니다.

English

In this work, we present Xwin-LM, a comprehensive suite of alignment methodologies for large language models (LLMs). This suite encompasses several key techniques, including supervised finetuning (SFT), reward modeling (RM), rejection sampling finetuning (RS), and direct preference optimization (DPO). The key components are as follows: (1) Xwin-LM-SFT, models initially finetuned with high-quality instruction data; (2) Xwin-Pair, a large-scale, multi-turn preference dataset meticulously annotated using GPT-4; (3) Xwin-RM, reward models trained on Xwin-Pair, developed at scales of 7B, 13B, and 70B parameters; (4) Xwin-Set, a multiwise preference dataset in which each prompt is linked to 64 unique responses generated by Xwin-LM-SFT and scored by Xwin-RM; (5) Xwin-LM-RS, models finetuned with the highest-scoring responses from Xwin-Set; (6) Xwin-LM-DPO, models further optimized on Xwin-Set using the DPO algorithm. Our evaluations on AlpacaEval and MT-bench demonstrate consistent and significant improvements across the pipeline, demonstrating the strength and scalability of Xwin-LM. The repository https://github.com/Xwin-LM/Xwin-LM will be continually updated to foster community research.

Xwin-LM: 대규모 언어 모델을 위한 강력하고 확장 가능한 정렬 실천법

Xwin-LM: Strong and Scalable Alignment Practice for LLMs

초록

Support