ZClip:大型語言模型預訓練中的自適應尖峰抑制
ZClip: Adaptive Spike Mitigation for LLM Pre-Training
April 3, 2025
作者: Abhay Kumar, Louis Owen, Nilabhra Roy Chowdhury, Fabian Güra
cs.AI
摘要
訓練大型語言模型(LLMs)面臨諸多挑戰,其中包括梯度不穩定性和損失突增。這些現象可能導致災難性發散,需要耗費大量資源進行檢查點恢復和數據批次跳過。傳統的梯度裁剪技術,如固定值或基於範數的方法,由於依賴於固定閾值或啟發式規則,無法有效解決這些問題,導致學習效率低下且需要頻繁手動干預。在本研究中,我們提出了ZClip,一種自適應梯度裁剪算法,它根據梯度範數隨時間變化的統計特性動態調整裁剪閾值。與以往的被動策略不同,ZClip無需對梯度範數的規模和時間演變做出任何先驗假設,便能主動適應訓練動態。其核心在於利用基於z分數的異常檢測來識別並緩解大幅梯度突增,從而防止惡性損失突增,同時不干擾模型的正常收斂。我們的代碼已公開於:https://github.com/bluorion-com/ZClip。
English
Training large language models (LLMs) presents numerous challenges, including
gradient instability and loss spikes. These phenomena can lead to catastrophic
divergence, requiring costly checkpoint restoration and data batch skipping.
Traditional gradient clipping techniques, such as constant or norm-based
methods, fail to address these issues effectively due to their reliance on
fixed thresholds or heuristics, leading to inefficient learning and requiring
frequent manual intervention. In this work, we propose ZClip, an adaptive
gradient clipping algorithm that dynamically adjusts the clipping threshold
based on statistical properties of gradient norms over time. Unlike prior
reactive strategies, ZClip proactively adapts to training dynamics without
making any prior assumptions on the scale and the temporal evolution of
gradient norms. At its core, it leverages z-score-based anomaly detection to
identify and mitigate large gradient spikes, preventing malignant loss spikes
while not interfering with convergence otherwise. Our code is available at:
https://github.com/bluorion-com/ZClip.Summary
AI-Generated Summary