无需训练的大型语言模型长上下文扩展

摘要

当输入标记数量超过预训练长度时，大型语言模型（LLMs）处理和生成连贯文本的能力明显减弱。考虑到使用更长序列对大规模模型进行微调的昂贵开销，我们提出了双块注意力（DCA），使Llama2 70B 能够支持超过100k标记的上下文窗口而无需持续训练。通过将长序列的注意力计算分解为基于块的模块，DCA 成功地捕捉了同一块内标记的相对位置信息（块内）和不同块之间的信息（块间），并与 Flash 注意力无缝集成。除了其令人印象深刻的外推能力外，DCA 在实际长上下文任务上实现了与或甚至优于微调模型相当的性能。与专有模型相比，我们的无需训练的70B模型达到了gpt-3.5-16k性能的94%，表明它是一个可行的开源替代方案。本研究使用的所有代码和数据均在 https://github.com/HKUNLP/ChunkLlama 上发布。

English

The ability of Large Language Models (LLMs) to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By decomposing the attention computation for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens within the same chunk (Intra-Chunk) and across distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash Attention. In addition to its impressive extrapolation capability, DCA achieves performance on practical long-context tasks that is comparable to or even better than that of finetuned models. When compared with proprietary models, our training-free 70B model attains 94% of the performance of gpt-3.5-16k, indicating it is a viable open-source alternative. All code and data used in this work are released at https://github.com/HKUNLP/ChunkLlama.

无需训练的大型语言模型长上下文扩展

Training-Free Long-Context Scaling of Large Language Models

摘要

Support