ChatPaper.aiChatPaper

Lizard:面向大规模语言模型的高效线性化框架

Lizard: An Efficient Linearization Framework for Large Language Models

July 11, 2025
作者: Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen
cs.AI

摘要

我们提出Lizard,一种线性化框架,将预训练的基于Transformer的大型语言模型(LLMs)转化为适用于无限上下文生成的灵活、次二次方复杂度架构。随着上下文长度增加,基于Transformer的LLMs因softmax注意力的二次方复杂度及不断增长的键值(KV)缓存而面临显著的内存与计算瓶颈。Lizard通过引入一种近似softmax注意力且保持输出质量的次二次方注意力机制,有效应对了这些限制。不同于以往受限于固定模型结构而常排除门控机制的线性化方法,Lizard借鉴了最新线性模型中的门控模块,实现了自适应内存控制,支持恒定内存推理,具备强大的长度泛化能力,并允许更灵活的模型设计。Lizard结合了用于全局上下文压缩的门控线性注意力与通过元记忆增强的滑动窗口注意力,形成了一种既能捕捉长程依赖又能处理细粒度局部交互的混合机制。此外,我们引入了一种硬件感知算法,以加速模型的训练速度。大量实验表明,Lizard在标准语言建模任务上几乎无损地恢复了教师模型的性能,同时显著超越了先前的线性化方法。在5-shot MMLU基准测试中,Lizard较之前模型提升了18分,并在关联回忆任务上展现出显著改进。
English
We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into flexible, subquadratic architectures for infinite-context generation. Transformer-based LLMs face significant memory and computational bottlenecks as context lengths increase, due to the quadratic complexity of softmax attention and the growing key-value (KV) cache. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving the output quality. Unlike previous linearization methods, which are often limited by fixed model structures and therefore exclude gating mechanisms, Lizard incorporates a gating module inspired by recent state-of-the-art linear models. This enables adaptive memory control, supports constant-memory inference, offers strong length generalization, and allows more flexible model design. Lizard combines gated linear attention for global context compression with sliding window attention enhanced by meta memory, forming a hybrid mechanism that captures both long-range dependencies and fine-grained local interactions. Moreover, we introduce a hardware-aware algorithm that accelerates the training speed of our models. Extensive experiments show that Lizard achieves near-lossless recovery of the teacher model's performance across standard language modeling tasks, while significantly outperforming previous linearization methods. On the 5-shot MMLU benchmark, Lizard improves over prior models by 18 points and shows significant improvements on associative recall tasks.
PDF71July 17, 2025