LongAlign：大型語言模型長文本對齊的配方

摘要

將大型語言模型擴展以有效處理長文本內容，需要對類似長度的輸入序列進行指導微調。為了解決這個問題，我們提出了LongAlign - 一種用於長文本內容對齊的指導數據、訓練和評估方法。首先，我們使用Self-Instruct構建了一個長度指令跟隨數據集。為確保數據的多樣性，該數據集涵蓋了來自各種長文本來源的廣泛任務。其次，我們採用了打包和排序批次策略，以加快對具有不同長度分佈的數據進行監督微調。此外，我們開發了一種損失加權方法，在打包訓練期間平衡不同序列對損失的貢獻。第三，我們引入了LongBench-Chat基準測試，用於評估對長度為10k-100k的查詢的指導跟隨能力。實驗表明，LongAlign在長文本任務中比現有的LLM配方表現提高了高達30％，同時也保持了它們在處理短期通用任務方面的熟練程度。代碼、數據和長對齊模型均在https://github.com/THUDM/LongAlign上開源。

English

Extending large language models to effectively handle long contexts requires instruction fine-tuning on input sequences of similar length. To address this, we present LongAlign -- a recipe of the instruction data, training, and evaluation for long context alignment. First, we construct a long instruction-following dataset using Self-Instruct. To ensure the data diversity, it covers a broad range of tasks from various long context sources. Second, we adopt the packing and sorted batching strategies to speed up supervised fine-tuning on data with varied length distributions. Additionally, we develop a loss weighting method to balance the contribution to the loss across different sequences during packing training. Third, we introduce the LongBench-Chat benchmark for evaluating instruction-following capabilities on queries of 10k-100k in length. Experiments show that LongAlign outperforms existing recipes for LLMs in long context tasks by up to 30\%, while also maintaining their proficiency in handling short, generic tasks. The code, data, and long-aligned models are open-sourced at https://github.com/THUDM/LongAlign.

LongAlign：大型語言模型長文本對齊的配方

LongAlign: A Recipe for Long Context Alignment of Large Language Models

摘要

Support