無需訓練的大型語言模型長文本擴展

Training-Free Long-Context Scaling of Large Language Models

February 27, 2024
作者: Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, Lingpeng Kong
cs.AI

摘要

當輸入 tokens 的數量超過預訓練長度時,大型語言模型(LLMs)處理和生成連貫文本的能力明顯下降。考慮到使用更長序列進行大規模模型微調的昂貴成本,我們提出了雙重塊注意力(DCA),使 Llama2 70B 能夠支持超過 100k tokens 的上下文窗口而無需持續訓練。通過將長序列的注意力計算分解為基於塊的模塊,DCA 成功地捕捉了同一塊內(塊內)和不同塊之間(塊間)的 tokens 的相對位置信息,並與 Flash Attention 無縫集成。除了其令人印象深刻的外推能力外,DCA 在實際長篇上下文任務上實現的性能與或甚至優於微調模型相當。與專有模型相比,我們的無需訓練的 70B 模型達到了 gpt-3.5-16k 性能的 94%,表明它是一個可行的開源替代方案。本研究使用的所有代碼和數據均在 https://github.com/HKUNLP/ChunkLlama 上發布。
English
The ability of Large Language Models (LLMs) to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By decomposing the attention computation for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens within the same chunk (Intra-Chunk) and across distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash Attention. In addition to its impressive extrapolation capability, DCA achieves performance on practical long-context tasks that is comparable to or even better than that of finetuned models. When compared with proprietary models, our training-free 70B model attains 94% of the performance of gpt-3.5-16k, indicating it is a viable open-source alternative. All code and data used in this work are released at https://github.com/HKUNLP/ChunkLlama.

Summary

AI-Generated Summary

PDF254December 15, 2024