动态线性注意力

摘要

大型语言模型（LLMs）在长上下文场景下的可扩展性本质上受限于标准注意力机制的二次复杂度，这促使学界采用具有次二次复杂度成本的线性注意力机制。为提升长上下文下的表示能力，近期研究以多状态方式组织记忆。然而，现有的多状态线性注意力方法依赖固定的状态合并策略，无法适应动态变化的词元重要性，导致关键词元被不可逆地掩盖，并引发长序列上的严重误差累积。为解决这一局限，我们提出DLA——一种面向多状态线性注意力的动态记忆建模框架。DLA引入了：（i）信息感知的动态状态合并机制，该机制基于词元级信息变化自适应确定状态边界，在语义转换区域保留高分辨率表示，同时对稳定区域进行激进压缩；（ii）容量受限的记忆建模机制，通过选择性合并相邻的低信息状态，在最小化信息损失的前提下控制记忆增长，从而维护一个固定大小、按时间顺序排列的状态缓存。我们在两种不同的线性注意力模型上对DLA进行预训练，并在涵盖三类任务的16个数据集上进行评估。实验结果表明，DLA相较于现有最优方法具有显著优越性。

English

The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts, recent approaches organize memory in a multi-state manner. However, existing multi-state linear attention methods rely on fixed state merging policies that cannot adapt to dynamically varying token importance, irreversibly obscuring critical tokens and causing severe error accumulation over long sequences. To address this limitation, we propose DLA, a dynamic memory modeling framework for multi-state linear attention. DLA introduces (i) Information-Aware Dynamic State Merging, which adaptively determines state boundaries based on token-level information variation, preserving high-resolution representations around semantic transitions while aggressively summarizing stable regions, and (ii) Capacity-Bounded Memory Modeling, which maintains a fixed-size, chronologically ordered state cache by selectively merging adjacent low-information states to control memory growth with minimal information loss. We pre-train DLA on two different linear attention models and evaluate on 16 datasets across three categories. Experimental results demonstrate the superiority of DLA over state-of-the-art.