フルアテンションの逆襲：わずか100トレーニングステップでフルアテンションをスパースに変換する

要旨

長文コンテキスト推論における大規模言語モデルは、フルアテンションによる二次関数的コストによって性能が制限されています。既存の効率的な代替手法は、多くの場合、ネイティブなスパース学習かヒューリスティックなトークン削除のいずれかに依存しており、効率性、学習コスト、精度の間に望ましくないトレードオフが生じています。本研究では、フルアテンションのLLMが本質的にスパースであり、最小限の適応だけで高度なスパースモデルに変換可能であることを示します。本アプローチは以下の3つの観察に基づいています。(1) 真に長文コンテキスト処理を必要とするアテンションヘッドはごく一部である。(2) 長距離検索は主に低次元部分空間によって支配されており、16次元のインデクサーで関連トークンを効率的に取得できる。(3) 有用なトークン予算はクエリに強く依存するため、固定のtop-kスパース化よりも動的なtop-p選択が適している。これらの知見に基づき、我々はRTPurboを提案します。これは検索ヘッドに対してのみ完全なKVキャッシュを保持し、スパースアテンションのための軽量トークンインデクサーを導入します。モデルの本質的なスパース性を活用することで、RTPurboはわずか数百ステップの学習でスパース化を実現します。長文コンテキストベンチマークと推論タスクの実験では、RTPurboはほぼ損失のない精度を維持しつつ、1Mコンテキストで最大9.36倍のプリフィル高速化、約2.01倍のデコード高速化といった実質的な効率向上をもたらすことが示されました。これらの結果は、高価なネイティブスパース事前学習を必要とせず、標準的なフルアテンション学習から強力なスパース推論が得られることを示唆しています。

English

Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top-p selection more suitable than fixed top-k sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model's intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36times prefill speedup at 1M context and about a 2.01times decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.