无需训练的大型语言模型长上下文扩展
Training-Free Long-Context Scaling of Large Language Models
February 27, 2024
作者: Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, Lingpeng Kong
cs.AI
摘要
当输入标记数量超过预训练长度时,大型语言模型(LLMs)处理和生成连贯文本的能力明显减弱。考虑到使用更长序列对大规模模型进行微调的昂贵开销,我们提出了双块注意力(DCA),使Llama2 70B 能够支持超过100k标记的上下文窗口而无需持续训练。通过将长序列的注意力计算分解为基于块的模块,DCA 成功地捕捉了同一块内标记的相对位置信息(块内)和不同块之间的信息(块间),并与 Flash 注意力无缝集成。除了其令人印象深刻的外推能力外,DCA 在实际长上下文任务上实现了与或甚至优于微调模型相当的性能。与专有模型相比,我们的无需训练的70B模型达到了gpt-3.5-16k性能的94%,表明它是一个可行的开源替代方案。本研究使用的所有代码和数据均在 https://github.com/HKUNLP/ChunkLlama 上发布。
English
The ability of Large Language Models (LLMs) to process and generate coherent
text is markedly weakened when the number of input tokens exceeds their
pretraining length. Given the expensive overhead of finetuning large-scale
models with longer sequences, we propose Dual Chunk Attention (DCA), which
enables Llama2 70B to support context windows of more than 100k tokens without
continual training. By decomposing the attention computation for long sequences
into chunk-based modules, DCA manages to effectively capture the relative
positional information of tokens within the same chunk (Intra-Chunk) and across
distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash
Attention. In addition to its impressive extrapolation capability, DCA achieves
performance on practical long-context tasks that is comparable to or even
better than that of finetuned models. When compared with proprietary models,
our training-free 70B model attains 94% of the performance of gpt-3.5-16k,
indicating it is a viable open-source alternative. All code and data used in
this work are released at https://github.com/HKUNLP/ChunkLlama.