ZClip：大規模言語モデル事前学習のための適応的スパイク緩和

要旨

大規模言語モデル（LLM）の学習には、勾配の不安定性や損失スパイクなど、数多くの課題が存在します。これらの現象は、致命的な発散を引き起こし、高コストなチェックポイントの復元やデータバッチのスキップを必要とします。従来の勾配クリッピング技術、例えば定数やノルムベースの手法は、固定された閾値やヒューリスティックに依存しているため、これらの問題を効果的に解決できず、非効率な学習や頻繁な手動介入を招きます。本研究では、ZClipという適応型勾配クリッピングアルゴリズムを提案します。ZClipは、時間経過に伴う勾配ノルムの統計的特性に基づいて、クリッピング閾値を動的に調整します。従来の反応型戦略とは異なり、ZClipは勾配ノルムのスケールや時間的進化について事前の仮定を置くことなく、学習動態に積極的に適応します。その核心には、zスコアベースの異常検出を活用し、大きな勾配スパイクを特定して緩和することで、悪性の損失スパイクを防ぎつつ、収束を妨げないようにします。私たちのコードは以下で公開されています：https://github.com/bluorion-com/ZClip。

English

Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively due to their reliance on fixed thresholds or heuristics, leading to inefficient learning and requiring frequent manual intervention. In this work, we propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time. Unlike prior reactive strategies, ZClip proactively adapts to training dynamics without making any prior assumptions on the scale and the temporal evolution of gradient norms. At its core, it leverages z-score-based anomaly detection to identify and mitigate large gradient spikes, preventing malignant loss spikes while not interfering with convergence otherwise. Our code is available at: https://github.com/bluorion-com/ZClip.

ZClip：大規模言語モデル事前学習のための適応的スパイク緩和

ZClip: Adaptive Spike Mitigation for LLM Pre-Training

要旨

Support