SageAttention3: Microscaling FP4 Attention voor Inferentie en een Verkenning van 8-Bits Training

Samenvatting

De efficiëntie van aandacht is belangrijk vanwege de kwadratische tijdscomplexiteit. We verbeteren de efficiëntie van aandacht door twee belangrijke bijdragen: Ten eerste maken we gebruik van de nieuwe FP4 Tensor Cores in Blackwell GPU's om de aandachtberekening te versnellen. Onze implementatie behaalt 1038 TOPS op de RTX5090, wat een 5x versnelling is ten opzichte van de snelste FlashAttention op de RTX5090. Experimenten tonen aan dat onze FP4-attentie de inferentie van verschillende modellen op een plug-and-play manier kan versnellen. Ten tweede introduceren we low-bit aandacht voor trainings taken. Bestaande low-bit aandachtswerken zoals FlashAttention3 en SageAttention richten zich alleen op inferentie. Echter, de efficiëntie van het trainen van grote modellen is ook belangrijk. Om te onderzoeken of low-bit aandacht effectief kan worden toegepast op trainings taken, ontwerpen we een nauwkeurige en efficiënte 8-bit aandacht voor zowel voorwaartse als achterwaartse propagatie. Experimenten geven aan dat 8-bit aandacht verliesloze prestaties behaalt bij fine-tuning taken, maar langzamere convergentie vertoont bij pre-training taken. De code zal beschikbaar zijn op https://github.com/thu-ml/SageAttention.

English

The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at https://github.com/thu-ml/SageAttention.

SageAttention3: Microscaling FP4 Attention voor Inferentie en een Verkenning van 8-Bits Training

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Samenvatting

Summary

Support

Support