ZClip：大型語言模型預訓練中的自適應尖峰抑制

摘要

訓練大型語言模型（LLMs）面臨諸多挑戰，其中包括梯度不穩定性和損失突增。這些現象可能導致災難性發散，需要耗費大量資源進行檢查點恢復和數據批次跳過。傳統的梯度裁剪技術，如固定值或基於範數的方法，由於依賴於固定閾值或啟發式規則，無法有效解決這些問題，導致學習效率低下且需要頻繁手動干預。在本研究中，我們提出了ZClip，一種自適應梯度裁剪算法，它根據梯度範數隨時間變化的統計特性動態調整裁剪閾值。與以往的被動策略不同，ZClip無需對梯度範數的規模和時間演變做出任何先驗假設，便能主動適應訓練動態。其核心在於利用基於z分數的異常檢測來識別並緩解大幅梯度突增，從而防止惡性損失突增，同時不干擾模型的正常收斂。我們的代碼已公開於：https://github.com/bluorion-com/ZClip。

English

Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively due to their reliance on fixed thresholds or heuristics, leading to inefficient learning and requiring frequent manual intervention. In this work, we propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time. Unlike prior reactive strategies, ZClip proactively adapts to training dynamics without making any prior assumptions on the scale and the temporal evolution of gradient norms. At its core, it leverages z-score-based anomaly detection to identify and mitigate large gradient spikes, preventing malignant loss spikes while not interfering with convergence otherwise. Our code is available at: https://github.com/bluorion-com/ZClip.

ZClip：大型語言模型預訓練中的自適應尖峰抑制

ZClip: Adaptive Spike Mitigation for LLM Pre-Training

摘要

Support