ThriftAttention：長文脈FP4注意機構のための選択的混合精度

要旨

効率的なアテンションアルゴリズムは、長いコンテキストを持つワークロードにおけるアテンションの二次コストを軽減する上で重要である。先行研究では、Blackwell GPU上でブロックスケール量子化手法を活用し、アテンション計算を4ビット精度に移行することで推論を高速化している。しかし、これらの手法は長いコンテキスト設定において著しい品質劣化を引き起こす。本論文では、量子化誤差が出力に与える影響は極めて不均一であり、各クエリ-キー相互作用の重要度が高まるにつれて増大することを示す。その結果、機能的に関連する誤差は、最も重要なトークンを含む少数のアテンションブロックに集中する。本論文では、FP4の推論効率でFP16に近い長コンテキスト品質を実現する低ビットアテンションの変種「ThriftAttention」を提案する。このアプローチは2段階で進行する。第一に、ヒューリスティックにより重要度の高いクエリ-キーブロックペアを少数選出し、FP16精度で処理する。第二に、選出されたブロックはFP16で計算し、残りのブロックはFP4で計算し、両経路をオンラインソフトマックスを介して単一の出力に統合する。長コンテキストベンチマークおよびモデルファミリーにわたって実証したところ、クエリ-キーブロックのわずか5%をFP16で計算することで、ThriftAttentionは平均してFP4からFP16への性能ギャップの89.1%を回復する。また、ThriftAttentionの優位性はシーケンス長が長くなるにつれて拡大し、長いコンテキストで観察される体系的なFP4品質劣化を軽減することを示す。コードはhttps://github.com/joesharratt1229/ThriftAttention で入手可能である。

English

Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings. We show that the output impact of quantisation error is highly non-uniform and increases with the importance of each query-key interaction, concentrating functionally relevant error in a small number of attention blocks that contain the most important tokens. We propose ThriftAttention, a low-bit attention variant that delivers near-FP16 long-context quality at FP4 inference efficiency. This approach proceeds in two stages. First, a heuristic rapidly selects a small number of important query-key block pairs for FP16 precision. Second, the selected blocks are computed in FP16 and the remaining blocks in FP4, with both paths merged via online softmax into a single output. We demonstrate across long-context benchmarks and model families that by computing only 5% of query-key blocks in FP16, ThriftAttention recovers on average 89.1% of the FP4-to-FP16 performance gap. We show ThriftAttention's advantage grows with sequence length, mitigating the systematic FP4 quality degradation observed at longer contexts. The code is available at https://github.com/joesharratt1229/ThriftAttention.