動態線性注意力

摘要

大型語言模型（LLMs）在長上下文中的可擴展性根本上受到標準注意力機制的二次複雜度的限制，從而促成了採用具有次二次成本的線性注意力機制。為了提升長上下文下的表示能力，近期方法以多狀態方式組織記憶。然而，現有的多狀態線性注意力方法依賴於固定的狀態合併策略，無法適應動態變化的 token 重要性，不可逆地模糊了關鍵 token，並在長序列中導致嚴重的誤差累積。為了解決這一限制，我們提出了 DLA，一種用於多狀態線性注意力的動態記憶建模框架。DLA 引入了 (i) 信息感知動態狀態合併（Information-Aware Dynamic State Merging），根據 token 層級的信息變化自適應地確定狀態邊界，在語義轉換周圍保留高解析度表示，同時積極總結穩定區域；以及 (ii) 容量受限記憶建模（Capacity-Bounded Memory Modeling），通過選擇性地合併相鄰的低信息狀態來維持一個固定大小、按時間順序排列的狀態快取，從而以最小的信息損失控制記憶增長。我們在兩種不同的線性注意力模型上預訓練 DLA，並在三類共 16 個數據集上進行評估。實驗結果證明了 DLA 相對於當前最先進方法的優越性。

English

The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts, recent approaches organize memory in a multi-state manner. However, existing multi-state linear attention methods rely on fixed state merging policies that cannot adapt to dynamically varying token importance, irreversibly obscuring critical tokens and causing severe error accumulation over long sequences. To address this limitation, we propose DLA, a dynamic memory modeling framework for multi-state linear attention. DLA introduces (i) Information-Aware Dynamic State Merging, which adaptively determines state boundaries based on token-level information variation, preserving high-resolution representations around semantic transitions while aggressively summarizing stable regions, and (ii) Capacity-Bounded Memory Modeling, which maintains a fixed-size, chronologically ordered state cache by selectively merging adjacent low-information states to control memory growth with minimal information loss. We pre-train DLA on two different linear attention models and evaluate on 16 datasets across three categories. Experimental results demonstrate the superiority of DLA over state-of-the-art.