ChatPaper.aiChatPaper

全注意力回归:在百步训练内将全注意力转化为稀疏注意力

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

May 16, 2026
作者: Yanke Zhou, Yiduo Li, Hanlin Tang, Maohua Li, Kan Liu, Lan Tao, Lin Qu, Yuan Yao, Xiaoxing Ma
cs.AI

摘要

大语言模型中的长上下文推理受限于全注意力的二次复杂度瓶颈。现有的高效替代方案通常依赖原生稀疏训练或启发式令牌驱逐,在效率、训练成本和准确性之间产生了不可取的权衡。本研究表明,全注意力大语言模型本质上已经具有稀疏性,仅需极少的适配即可转化为高稀疏模型。我们的方法基于三个观察:(1)仅有少数注意力头真正需要完整的长上下文处理;(2)长距离检索主要由低维子空间主导,使得相关令牌可通过16维索引器高效检索;(3)有效令牌预算随查询动态变化,因此动态top-p选择比固定top-k稀疏化更适用。基于这些洞察,我们提出RTPurbo:仅保留检索头的完整KV缓存,并引入轻量级令牌索引器实现稀疏注意力。通过利用模型内在稀疏性,RTPurbo仅需几百步训练即可实现稀疏化。在长上下文基准和推理任务上的实验表明,RTPurbo在保持近乎无损准确率的同时实现了显著的效率提升,包括在1M上下文长度下预填充加速最高达9.36倍,解码加速约2.01倍。这些结果表明,通过标准的全注意力训练即可获得强大的稀疏推理能力,而无需昂贵的原生稀疏预训练。
English
Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top-p selection more suitable than fixed top-k sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model's intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36times prefill speedup at 1M context and about a 2.01times decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.