Liger:將大型語言模型線性化為門控循環結構
Liger: Linearizing Large Language Models to Gated Recurrent Structures
March 3, 2025
作者: Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, Yu Cheng
cs.AI
摘要
具有线性循环建模能力的Transformer架构提供了线性时间的训练和恒定内存的推理。尽管这些非标准架构已展现出高效性和性能优势,但从头开始预训练此类模型仍成本高昂且风险较大。大型语言模型(LLM)的线性化技术将预训练的标准模型转化为线性循环结构,从而实现了更高效的部署。然而,现有的线性化方法通常需要引入额外的特征映射模块,这些模块不仅需要大量的微调,还忽视了当前最先进的线性循环模型中采用的门控机制。针对这些问题,本文提出了Liger(Linearizing LLMs to gated recurrent structures的缩写),这是一种将预训练LLM转换为门控线性循环模型的新方法,且无需增加额外参数。Liger通过重新利用预训练的关键矩阵权重来构建多样化的门控机制,促进了多种门控循环结构的形成,同时避免了从头训练额外组件的需求。采用低秩适应(LoRA)进行轻量级微调,Liger使线性化后的门控循环模型性能恢复至与原始LLM相当的水平。此外,我们引入了Liger Attention,一种层内混合注意力机制,在仅使用0.02%预训练令牌的情况下,显著恢复了基于Transformer的LLM 93%的性能,在多个基准测试中取得了具有竞争力的结果,这一成果在1B到8B参数规模的模型上得到了验证。代码已发布于https://github.com/OpenSparseLLMs/Linearization。
English
Transformers with linear recurrent modeling offer linear-time training and
constant-memory inference. Despite their demonstrated efficiency and
performance, pretraining such non-standard architectures from scratch remains
costly and risky. The linearization of large language models (LLMs) transforms
pretrained standard models into linear recurrent structures, enabling more
efficient deployment. However, current linearization methods typically
introduce additional feature map modules that require extensive fine-tuning and
overlook the gating mechanisms used in state-of-the-art linear recurrent
models. To address these issues, this paper presents Liger, short for
Linearizing LLMs to gated recurrent structures. Liger is a novel approach for
converting pretrained LLMs into gated linear recurrent models without adding
extra parameters. It repurposes the pretrained key matrix weights to construct
diverse gating mechanisms, facilitating the formation of various gated
recurrent structures while avoiding the need to train additional components
from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA),
Liger restores the performance of the linearized gated recurrent models to
match that of the original LLMs. Additionally, we introduce Liger Attention, an
intra-layer hybrid attention mechanism, which significantly recovers 93\% of
the Transformer-based LLM at 0.02\% pre-training tokens during the
linearization process, achieving competitive results across multiple
benchmarks, as validated on models ranging from 1B to 8B parameters. Code is
available at https://github.com/OpenSparseLLMs/Linearization.Summary
AI-Generated Summary