ThriftAttention: 장문맥 FP4 어텐션을 위한 선택적 혼합 정밀도

초록

효율적인 어텐션 알고리즘은 긴 컨텍스트 작업에서 어텐션의 이차 비용을 완화하는 데 중요하다. 기존 연구는 블랙웰 GPU에서 블록 스케일 양자화 기법을 활용하여 어텐션 연산을 4비트 정밀도로 이동시켜 추론을 가속화했다. 그러나 이러한 기법은 긴 컨텍스트 설정에서 상당한 품질 저하를 초래한다. 우리는 양자화 오류의 출력 영향이 매우 비균일하며 각 쿼리-키 상호작용의 중요도에 따라 증가하여, 가장 중요한 토큰을 포함하는 소수의 어텐션 블록에 기능적으로 관련된 오류가 집중됨을 보여준다. 우리는 FP4 추론 효율성에 가까운 FP16 수준의 긴 컨텍스트 품질을 제공하는 저비트 어텐션 변형인 ThriftAttention을 제안한다. 이 접근법은 두 단계로 진행된다. 첫째, 휴리스틱이 FP16 정밀도를 위한 소수의 중요한 쿼리-키 블록 쌍을 신속하게 선택한다. 둘째, 선택된 블록은 FP16으로 계산되고 나머지 블록은 FP4로 계산되며, 두 경로 모두 온라인 소프트맥스를 통해 단일 출력으로 병합된다. 우리는 긴 컨텍스트 벤치마크와 모델 제품군에 걸쳐 쿼리-키 블록의 5%만 FP16으로 계산함으로써 ThriftAttention이 FP4 대 FP16 성능 격차의 평균 89.1%를 회복함을 입증한다. 또한 ThriftAttention의 이점이 시퀀스 길이에 따라 증가하여 긴 컨텍스트에서 관찰되는 체계적인 FP4 품질 저하를 완화함을 보여준다. 코드는 https://github.com/joesharratt1229/ThriftAttention에서 확인할 수 있다.

English

Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings. We show that the output impact of quantisation error is highly non-uniform and increases with the importance of each query-key interaction, concentrating functionally relevant error in a small number of attention blocks that contain the most important tokens. We propose ThriftAttention, a low-bit attention variant that delivers near-FP16 long-context quality at FP4 inference efficiency. This approach proceeds in two stages. First, a heuristic rapidly selects a small number of important query-key block pairs for FP16 precision. Second, the selected blocks are computed in FP16 and the remaining blocks in FP4, with both paths merged via online softmax into a single output. We demonstrate across long-context benchmarks and model families that by computing only 5% of query-key blocks in FP16, ThriftAttention recovers on average 89.1% of the FP4-to-FP16 performance gap. We show ThriftAttention's advantage grows with sequence length, mitigating the systematic FP4 quality degradation observed at longer contexts. The code is available at https://github.com/joesharratt1229/ThriftAttention.