LongAlign：大型语言模型长上下文对齐的配方

摘要

为了有效处理长上下文，需要将大型语言模型扩展到能够处理长序列输入的能力。为了解决这个问题，我们提出了LongAlign -- 一种用于长上下文对齐的指导数据、训练和评估方法。首先，我们使用Self-Instruct构建了一个长指导数据集。为了确保数据的多样性，该数据集涵盖了来自各种长上下文来源的广泛任务。其次，我们采用了打包和排序批处理策略，以加快对具有不同长度分布的数据进行监督微调。此外，我们开发了一种损失加权方法，在打包训练期间平衡不同序列对损失的贡献。第三，我们引入了LongBench-Chat基准测试，用于评估对长度为10k-100k的查询的指导跟随能力。实验表明，LongAlign在长上下文任务中比现有的大型语言模型配方表现提高了高达30\%，同时也保持了它们在处理短、通用任务方面的熟练程度。代码、数据和长对齐模型可在https://github.com/THUDM/LongAlign 上开源。

English

Extending large language models to effectively handle long contexts requires instruction fine-tuning on input sequences of similar length. To address this, we present LongAlign -- a recipe of the instruction data, training, and evaluation for long context alignment. First, we construct a long instruction-following dataset using Self-Instruct. To ensure the data diversity, it covers a broad range of tasks from various long context sources. Second, we adopt the packing and sorted batching strategies to speed up supervised fine-tuning on data with varied length distributions. Additionally, we develop a loss weighting method to balance the contribution to the loss across different sequences during packing training. Third, we introduce the LongBench-Chat benchmark for evaluating instruction-following capabilities on queries of 10k-100k in length. Experiments show that LongAlign outperforms existing recipes for LLMs in long context tasks by up to 30\%, while also maintaining their proficiency in handling short, generic tasks. The code, data, and long-aligned models are open-sourced at https://github.com/THUDM/LongAlign.

LongAlign：大型语言模型长上下文对齐的配方

LongAlign: A Recipe for Long Context Alignment of Large Language Models

摘要

Support