從4K到400K:通過啟動信標擴展LLM的上下文
Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon
January 7, 2024
作者: Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou
cs.AI
摘要
對於大型語言模型來說,利用長文本構成一個巨大挑戰,因為它們的上下文窗口長度有限。儘管透過微調可以擴展上下文窗口,但這將在訓練和推論時間上產生相當大的成本,並對語言模型的原始能力產生不利影響。在這項研究中,我們提出了「啟動信標」(Activation Beacon)的概念,將語言模型的原始啟動轉換為更緊湊的形式,使其能夠在有限的上下文窗口中感知更長的上下文。啟動信標被引入作為語言模型的即插即用模組。它在完全保留語言模型對短上下文的原始能力的同時,擴展了處理長上下文的新能力。此外,它使用短滑動窗口來處理長上下文,從而在訓練和推論中實現了競爭力的記憶和時間效率。啟動信標是通過自回歸任務學習的,條件是一組具有多樣化緊縮比率的信標。由於這種處理方式,它可以僅通過短序列數據在短短的10K步驟內高效訓練,僅在單個8xA800 GPU機器上消耗不到9小時。實驗研究表明,啟動信標能夠將Llama-2-7B的上下文長度增加100倍(從4K增加到400K),同時在長上下文生成和理解任務上取得了優異的結果。我們的模型和代碼將在BGE存儲庫中提供。
English
The utilization of long contexts poses a big challenge for large language
models due to their limited context window length. Although the context window
can be extended through fine-tuning, it will result in a considerable cost at
both training and inference time, and exert an unfavorable impact to the LLM's
original capabilities. In this work, we propose Activation Beacon, which
condenses LLM's raw activations into more compact forms such that it can
perceive a much longer context with a limited context window. Activation Beacon
is introduced as a plug-and-play module for the LLM. It fully preserves the
LLM's original capability on short contexts while extending the new capability
on processing longer contexts. Besides, it works with short sliding windows to
process the long context, which achieves a competitive memory and time
efficiency in both training and inference. Activation Beacon is learned by the
auto-regression task conditioned on a mixture of beacons with diversified
condensing ratios. Thanks to such a treatment, it can be efficiently trained
purely with short-sequence data in just 10K steps, which consumes less than 9
hours on a single 8xA800 GPU machine. The experimental studies show that
Activation Beacon is able to extend Llama-2-7B's context length by times100
times (from 4K to 400K), meanwhile achieving a superior result on both
long-context generation and understanding tasks. Our model and code will be
available at the BGE repository.