LongAlign: 대규모 언어 모델의 장기 문맥 정렬을 위한 레시피

초록

대규모 언어 모델을 확장하여 긴 문맥을 효과적으로 처리하려면 유사한 길이의 입력 시퀀스에 대한 지시 미세 조정이 필요합니다. 이를 해결하기 위해, 우리는 긴 문맥 정렬을 위한 지시 데이터, 훈련, 평가 레시피인 LongAlign을 제안합니다. 먼저, Self-Instruct를 사용하여 긴 지시-따르기 데이터셋을 구축합니다. 데이터 다양성을 보장하기 위해, 다양한 긴 문맥 소스에서 광범위한 작업을 포함합니다. 둘째, 다양한 길이 분포를 가진 데이터에 대한 지도 미세 조정을 가속화하기 위해 패킹 및 정렬 배치 전략을 채택합니다. 또한, 패킹 훈련 중 다른 시퀀스 간의 손실 기여도를 균형 있게 조정하기 위한 손실 가중치 방법을 개발합니다. 셋째, 10k-100k 길이의 쿼리에 대한 지시-따르기 능력을 평가하기 위한 LongBench-Chat 벤치마크를 소개합니다. 실험 결과, LongAlign은 기존의 대규모 언어 모델 레시피보다 긴 문맥 작업에서 최대 30% 더 우수한 성능을 보이며, 짧고 일반적인 작업 처리 능력도 유지합니다. 코드, 데이터, 그리고 긴 문맥 정렬 모델은 https://github.com/THUDM/LongAlign에서 오픈소스로 제공됩니다.

English

Extending large language models to effectively handle long contexts requires instruction fine-tuning on input sequences of similar length. To address this, we present LongAlign -- a recipe of the instruction data, training, and evaluation for long context alignment. First, we construct a long instruction-following dataset using Self-Instruct. To ensure the data diversity, it covers a broad range of tasks from various long context sources. Second, we adopt the packing and sorted batching strategies to speed up supervised fine-tuning on data with varied length distributions. Additionally, we develop a loss weighting method to balance the contribution to the loss across different sequences during packing training. Third, we introduce the LongBench-Chat benchmark for evaluating instruction-following capabilities on queries of 10k-100k in length. Experiments show that LongAlign outperforms existing recipes for LLMs in long context tasks by up to 30\%, while also maintaining their proficiency in handling short, generic tasks. The code, data, and long-aligned models are open-sourced at https://github.com/THUDM/LongAlign.

LongAlign: 대규모 언어 모델의 장기 문맥 정렬을 위한 레시피

LongAlign: A Recipe for Long Context Alignment of Large Language Models

초록

Support