ChatPaper.aiChatPaper

在Transformer中實現強健的N:M稀疏訓練的漸進梯度流

Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

February 7, 2024
作者: Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay Subramanian, Sheng-Chun Kao, Shivani Agrawal, Utku Evci, Tushar Krishna
cs.AI

摘要

由於相對輕微的開銷和提高效率,N:M 結構稀疏性引起了廣泛的興趣。此外,這種稀疏性形式因其較小的表示開銷對減少記憶體佔用量具有相當吸引力。已經有一些努力為 N:M 結構稀疏性開發訓練配方,主要集中在低稀疏區域(約50\%)。然而,使用這些方法訓練的模型在面對高稀疏區域(>80\%)時性能往往會下降。在這項工作中,我們研究了現有稀疏訓練配方在高稀疏區域的有效性,並指出這些方法未能維持與低稀疏區域相當的模型品質。我們證明導致這種差異的一個重要因素是梯度幅度中引入的噪音水平升高。為了減輕這種不良影響,我們採用衰減機制逐步限制梯度流向被修剪元素。我們的方法在高稀疏區域分別提高了視覺和語言模型達 2% 和 5% 的模型品質。我們還根據 FLOPs(每秒浮點運算數)評估模型準確性和訓練計算成本之間的平衡。在等效的訓練 FLOPs 情況下,我們的方法相比傳統的稀疏訓練配方表現更好,準確性提高了高達 2%。源代碼可在 https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsity 找到。
English
N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the memory footprint owing to their modest representation overhead. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions (sim50\%). Nonetheless, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions (>80\%). In this work, we study the effectiveness of existing sparse training recipes at high-sparsity regions and argue that these methods fail to sustain the model quality on par with low-sparsity regions. We demonstrate that the significant factor contributing to this disparity is the presence of elevated levels of induced noise in the gradient magnitudes. To mitigate this undesirable effect, we employ decay mechanisms to progressively restrict the flow of gradients towards pruned elements. Our approach improves the model quality by up to 2% and 5% in vision and language models at high sparsity regime, respectively. We also evaluate the trade-off between model accuracy and training compute cost in terms of FLOPs. At iso-training FLOPs, our method yields better performance compared to conventional sparse training recipes, exhibiting an accuracy improvement of up to 2%. The source code is available at https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsity.
PDF21December 15, 2024