全注意力反擊：在百步訓練內將全注意力轉換為稀疏注意力

摘要

大語言模型中的長上下文推理受到完整注意力二次計算成本的制約。現有的高效替代方案通常依賴於原生稀疏訓練或啟發式 Token 淘汰，這在效率、訓練成本與準確性之間造成了不良的取捨。在本研究中，我們證明完整注意力的大型語言模型本質上已具備稀疏性，只需極少的調整即可轉化為高度稀疏的模型。我們的方法基於三項觀察：(1) 僅有少部分注意力頭真正需要完整的長上下文處理；(2) 長距離檢索主要由低維子空間主導，從而可藉由 16 維索引器高效檢索相關 Token；(3) 有用的 Token 預算高度依賴於查詢，因此動態 top-p 選擇比固定 top-k 稀疏化更為合適。基於這些見解，我們提出 RTPurbo，該方法僅為檢索頭保留完整的 KV 快取，並引入輕量級 Token 索引器以實現稀疏注意力。透過利用模型內在的稀疏性，RTPurbo 僅需數百步訓練即可完成稀疏化。在長上下文基準測試與推理任務上的實驗表明，RTPurbo 在保持近乎無損準確度的同時，帶來了顯著的效率提升，包括在 1M 上下文長度下高達 9.36 倍的預填充加速，以及約 2.01 倍的解碼加速。這些結果表明，無需昂貴的原生稀疏預訓練，即可從標準的完整注意力訓練中獲得強大的稀疏推理能力。

English

Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top-p selection more suitable than fixed top-k sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model's intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36times prefill speedup at 1M context and about a 2.01times decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.