ChatPaper.aiChatPaper

从4K飙升至400K:利用激活信标扩展LLM的上下文

Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon

January 7, 2024
作者: Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou
cs.AI

摘要

由于其有限的上下文窗口长度,长上下文的利用对大型语言模型构成了一项重大挑战。虽然可以通过微调来扩展上下文窗口,但这将导致训练和推断时间的显著成本,并对LLM的原始能力产生不利影响。在这项工作中,我们提出了激活信标(Activation Beacon),它将LLM的原始激活压缩为更紧凑的形式,使其能够在有限的上下文窗口内感知更长的上下文。激活信标被引入为LLM的即插即用模块。它在完全保留LLM在短上下文上的原始能力的同时,扩展了处理更长上下文的新能力。此外,它使用短滑动窗口来处理长上下文,从而在训练和推断中实现了竞争性的内存和时间效率。激活信标通过自回归任务学习,该任务以具有多样化压缩比的信标混合为条件。由于这种处理方式,它可以仅通过短序列数据在短短的10K步内进行高效训练,在单个8xA800 GPU机器上不到9小时。实验研究表明,激活信标能够将Llama-2-7B的上下文长度扩展100倍(从4K到400K),同时在长上下文生成和理解任务上取得了优越结果。我们的模型和代码将在BGE代码库中提供。
English
The utilization of long contexts poses a big challenge for large language models due to their limited context window length. Although the context window can be extended through fine-tuning, it will result in a considerable cost at both training and inference time, and exert an unfavorable impact to the LLM's original capabilities. In this work, we propose Activation Beacon, which condenses LLM's raw activations into more compact forms such that it can perceive a much longer context with a limited context window. Activation Beacon is introduced as a plug-and-play module for the LLM. It fully preserves the LLM's original capability on short contexts while extending the new capability on processing longer contexts. Besides, it works with short sliding windows to process the long context, which achieves a competitive memory and time efficiency in both training and inference. Activation Beacon is learned by the auto-regression task conditioned on a mixture of beacons with diversified condensing ratios. Thanks to such a treatment, it can be efficiently trained purely with short-sequence data in just 10K steps, which consumes less than 9 hours on a single 8xA800 GPU machine. The experimental studies show that Activation Beacon is able to extend Llama-2-7B's context length by times100 times (from 4K to 400K), meanwhile achieving a superior result on both long-context generation and understanding tasks. Our model and code will be available at the BGE repository.
PDF281December 15, 2024