ChatPaper.aiChatPaper

Lizard:面向大型語言模型的高效線性化框架

Lizard: An Efficient Linearization Framework for Large Language Models

July 11, 2025
作者: Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen
cs.AI

摘要

我們提出Lizard,這是一個線性化框架,旨在將預訓練的基於Transformer的大型語言模型(LLMs)轉化為適用於無限上下文生成的靈活、次二次方複雜度架構。隨著上下文長度的增加,基於Transformer的LLMs面臨顯著的記憶體和計算瓶頸,這源於softmax注意力的二次方複雜度以及不斷增長的鍵值(KV)快取。Lizard通過引入一種次二次方注意力機制來解決這些限制,該機制緊密逼近softmax注意力,同時保持輸出質量。與以往的線性化方法不同,這些方法通常受限於固定的模型結構,因此排除了門控機制,Lizard則融合了受最新頂尖線性模型啟發的門控模組。這使得Lizard能夠實現自適應記憶體控制、支持恆定記憶體推理、展現出強大的長度泛化能力,並允許更靈活的模型設計。Lizard結合了用於全局上下文壓縮的門控線性注意力與由元記憶體增強的滑動窗口注意力,形成了一種混合機制,既能捕捉長距離依賴,又能精細處理局部交互。此外,我們引入了一種硬件感知算法,以加速模型的訓練速度。大量實驗表明,Lizard在標準語言建模任務上幾乎無損地恢復了教師模型的性能,同時顯著超越了以往的線性化方法。在5-shot MMLU基準測試中,Lizard相比先前模型提升了18分,並在關聯回憶任務上展現出顯著改進。
English
We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into flexible, subquadratic architectures for infinite-context generation. Transformer-based LLMs face significant memory and computational bottlenecks as context lengths increase, due to the quadratic complexity of softmax attention and the growing key-value (KV) cache. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving the output quality. Unlike previous linearization methods, which are often limited by fixed model structures and therefore exclude gating mechanisms, Lizard incorporates a gating module inspired by recent state-of-the-art linear models. This enables adaptive memory control, supports constant-memory inference, offers strong length generalization, and allows more flexible model design. Lizard combines gated linear attention for global context compression with sliding window attention enhanced by meta memory, forming a hybrid mechanism that captures both long-range dependencies and fine-grained local interactions. Moreover, we introduce a hardware-aware algorithm that accelerates the training speed of our models. Extensive experiments show that Lizard achieves near-lossless recovery of the teacher model's performance across standard language modeling tasks, while significantly outperforming previous linearization methods. On the 5-shot MMLU benchmark, Lizard improves over prior models by 18 points and shows significant improvements on associative recall tasks.
PDF81July 17, 2025