ThriftAttention：面向長上下文FP4注意力之選擇性混合精度

摘要

高效注意力演算法對於減輕長上下文任務中注意力機制的二次方計算成本至關重要。先前的研究在Blackwell GPU上採用區塊縮放量化技術，將注意力計算降至4位元精度以加速推論。然而，在長上下文環境中，這些技術會導致顯著的品質下降。我們證明了量化誤差的輸出影響具有高度不均勻性，且隨著每個查詢-鍵交互的重要性增加而加劇，使得功能相關的誤差集中在包含最重要token的少數注意力區塊中。為此，我們提出ThriftAttention，一種低位元注意力變體，能在FP4推論效率下提供接近FP16的長上下文品質。此方法分兩階段進行：首先，啟發式方法快速選出少數重要的查詢-鍵區塊對，以FP16精度處理；其次，選定區塊以FP16計算，其餘區塊以FP4計算，兩者透過在線Softmax合併為單一輸出。我們在長上下文基準測試與多種模型系列中證明，僅需以FP16計算5%的查詢-鍵區塊，ThriftAttention平均能恢復89.1%的FP4至FP16效能差距。我們也顯示ThriftAttention的優勢隨序列長度增加而擴大，可緩解長上下文下FP4的系統性品質衰退。程式碼開源於 https://github.com/joesharratt1229/ThriftAttention。

English

Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings. We show that the output impact of quantisation error is highly non-uniform and increases with the importance of each query-key interaction, concentrating functionally relevant error in a small number of attention blocks that contain the most important tokens. We propose ThriftAttention, a low-bit attention variant that delivers near-FP16 long-context quality at FP4 inference efficiency. This approach proceeds in two stages. First, a heuristic rapidly selects a small number of important query-key block pairs for FP16 precision. Second, the selected blocks are computed in FP16 and the remaining blocks in FP4, with both paths merged via online softmax into a single output. We demonstrate across long-context benchmarks and model families that by computing only 5% of query-key blocks in FP16, ThriftAttention recovers on average 89.1% of the FP4-to-FP16 performance gap. We show ThriftAttention's advantage grows with sequence length, mitigating the systematic FP4 quality degradation observed at longer contexts. The code is available at https://github.com/joesharratt1229/ThriftAttention.