ThriftAttention: Selectieve gemengde precisie voor FP4-aandacht met lange context

Samenvatting

Efficiënte aandachtalgoritmen zijn cruciaal om de kwadratische kosten van aandacht bij werklasten met lange context te verminderen. Eerder werk maakt gebruik van blokgeschalde kwantisatietechnieken op Blackwell-GPU's om de aandachtsberekening naar 4-bit-precisie te verplaatsen en zo de inferentie te versnellen. Deze technieken leiden echter tot aanzienlijke kwaliteitsvermindering in omgevingen met lange context. Wij tonen aan dat de uitvoerimpact van kwantisatiefouten sterk niet-uniform is en toeneemt naarmate het belang van elke query-sleutelinteractie groter wordt, waarbij functioneel relevante fouten zich concentreren in een klein aantal aandachtsblokken die de belangrijkste tokens bevatten. Wij stellen ThriftAttention voor, een laag-bits-aandachtvariant die bijna FP16-kwaliteit voor lange context levert met FP4-inferentie-efficiëntie. Deze aanpak verloopt in twee fasen. Ten eerste selecteert een heuristiek snel een klein aantal belangrijke query-sleutelblokparen voor FP16-precisie. Ten tweede worden de geselecteerde blokken in FP16 berekend en de overige blokken in FP4, waarbij beide paden via online softmax worden samengevoegd tot één uitvoer. Wij tonen aan over benchmarks voor lange context en modelfamilies heen dat door slechts 5% van de query-sleutelblokken in FP16 te berekenen, ThriftAttention gemiddeld 89,1% van de FP4-naar-FP16-prestatiekloof herstelt. Wij laten zien dat het voordeel van ThriftAttention toeneemt met de sequentielengte, waardoor de systematische FP4-kwaliteitsvermindering die bij langere contexten wordt waargenomen, wordt beperkt. De code is beschikbaar op https://github.com/joesharratt1229/ThriftAttention.

English

Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings. We show that the output impact of quantisation error is highly non-uniform and increases with the importance of each query-key interaction, concentrating functionally relevant error in a small number of attention blocks that contain the most important tokens. We propose ThriftAttention, a low-bit attention variant that delivers near-FP16 long-context quality at FP4 inference efficiency. This approach proceeds in two stages. First, a heuristic rapidly selects a small number of important query-key block pairs for FP16 precision. Second, the selected blocks are computed in FP16 and the remaining blocks in FP4, with both paths merged via online softmax into a single output. We demonstrate across long-context benchmarks and model families that by computing only 5% of query-key blocks in FP16, ThriftAttention recovers on average 89.1% of the FP4-to-FP16 performance gap. We show ThriftAttention's advantage grows with sequence length, mitigating the systematic FP4 quality degradation observed at longer contexts. The code is available at https://github.com/joesharratt1229/ThriftAttention.