ThriftAttention: 长上下文FP4注意力的选择性混合精度

摘要

高效注意力算法对于缓解长上下文任务中注意力机制二次复杂度带来的计算成本至关重要。先前研究在Blackwell GPU上采用块级量化技术，将注意力计算降至4位精度以加速推理，但这种技术在长上下文场景中会导致显著的质量下降。我们证明，量化误差对输出的影响具有高度非均匀性，且随每个查询-键交互的重要性递增，使得功能相关的误差集中在包含最重要标记的少量注意力块中。为此，我们提出ThriftAttention——一种低比特注意力变体，在实现接近FP16长上下文质量的同时，保持FP4推理效率。该方法分两阶段进行：首先，通过启发式算法快速筛选少量重要的查询-键块对，保留FP16精度；其次，对所选块进行FP16计算，其余块采用FP4计算，并通过在线softmax将两条计算路径合并为单一输出。我们在多个长上下文基准测试和模型族上证明，仅需将5%的查询-键块以FP16计算，ThriftAttention即可平均恢复FP4与FP16性能差距的89.1%。实验表明，ThriftAttention的优势随序列长度增加而增强，有效缓解了长上下文场景中FP4的系统性质量退化。代码已开源至 https://github.com/joesharratt1229/ThriftAttention。

English

Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings. We show that the output impact of quantisation error is highly non-uniform and increases with the importance of each query-key interaction, concentrating functionally relevant error in a small number of attention blocks that contain the most important tokens. We propose ThriftAttention, a low-bit attention variant that delivers near-FP16 long-context quality at FP4 inference efficiency. This approach proceeds in two stages. First, a heuristic rapidly selects a small number of important query-key block pairs for FP16 precision. Second, the selected blocks are computed in FP16 and the remaining blocks in FP4, with both paths merged via online softmax into a single output. We demonstrate across long-context benchmarks and model families that by computing only 5% of query-key blocks in FP16, ThriftAttention recovers on average 89.1% of the FP4-to-FP16 performance gap. We show ThriftAttention's advantage grows with sequence length, mitigating the systematic FP4 quality degradation observed at longer contexts. The code is available at https://github.com/joesharratt1229/ThriftAttention.