在变压器中进行鲁棒的N:M稀疏训练的渐进梯度流
Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers
February 7, 2024
作者: Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay Subramanian, Sheng-Chun Kao, Shivani Agrawal, Utku Evci, Tushar Krishna
cs.AI
摘要
由于相对较小的开销和提高的效率,N:M结构稀疏性引起了广泛关注。此外,这种稀疏性形式因其较小的表示开销而对减少内存占用具有相当大的吸引力。已经有一些努力为N:M结构稀疏性开发训练配方,主要关注低稀疏度区域(约50%)。然而,使用这些方法训练的模型在面对高稀疏度区域(>80%)时性能往往会下降。在这项工作中,我们研究了现有稀疏训练配方在高稀疏度区域的有效性,并认为这些方法未能保持与低稀疏度区域相媲美的模型质量。我们证明,导致这种差异的重要因素是梯度幅度中引入的噪声水平升高。为了减轻这种不良影响,我们采用衰减机制逐渐限制梯度流向被修剪的元素。我们的方法在高稀疏度区域分别提高了视觉和语言模型的模型质量高达2%和5%。我们还评估了在FLOPs方面模型准确性和训练计算成本之间的权衡。在等训练FLOPs的情况下,我们的方法与传统稀疏训练配方相比表现更好,准确性提高了高达2%。源代码可在以下网址找到:https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsity。
English
N:M Structured sparsity has garnered significant interest as a result of
relatively modest overhead and improved efficiency. Additionally, this form of
sparsity holds considerable appeal for reducing the memory footprint owing to
their modest representation overhead. There have been efforts to develop
training recipes for N:M structured sparsity, they primarily focus on
low-sparsity regions (sim50\%). Nonetheless, performance of models trained
using these approaches tends to decline when confronted with high-sparsity
regions (>80\%). In this work, we study the effectiveness of existing sparse
training recipes at high-sparsity regions and argue that these methods
fail to sustain the model quality on par with low-sparsity regions. We
demonstrate that the significant factor contributing to this disparity is the
presence of elevated levels of induced noise in the gradient magnitudes. To
mitigate this undesirable effect, we employ decay mechanisms to progressively
restrict the flow of gradients towards pruned elements. Our approach improves
the model quality by up to 2% and 5% in vision and language models at
high sparsity regime, respectively. We also evaluate the trade-off between
model accuracy and training compute cost in terms of FLOPs. At iso-training
FLOPs, our method yields better performance compared to conventional sparse
training recipes, exhibiting an accuracy improvement of up to 2%. The source
code is available at
https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsity.