ChatPaper.aiChatPaper

低精度Transformer训练失败之因:基于Flash Attention的分析

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

October 5, 2025
作者: Haiquan Qiu, Quanming Yao
cs.AI

摘要

追求計算效率促使了低精度格式在訓練變換器模型中的廣泛採用。然而,這一進展常因眾所周知的訓練不穩定性而受阻。本文首次對一個長期未解的失敗案例提供了機制性解釋,即在低精度設置下使用閃爍注意力進行訓練時,會導致災難性的損失爆炸。我們深入的分析揭示,這一失敗並非隨機現象,而是由兩個相互交織的現象所引起:注意力機制內相似低秩表示的出現,以及低精度算術中固有偏見舍入誤差的累積效應。我們展示了這些因素如何形成一個錯誤積累的惡性循環,從而破壞權重更新,最終導致訓練動態失控。為驗證我們的發現,我們對閃爍注意力進行了最小程度的修改,以減輕舍入誤差中的偏見。這一簡單的改變穩定了訓練過程,證實了我們的分析,並為這一持久問題提供了實際的解決方案。
English
The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosions. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem.
PDF192October 9, 2025