ZClip: LLM 사전 학습을 위한 적응형 스파이크 완화 기법

초록

대규모 언어 모델(LLM)을 학습시키는 과정에서는 그래디언트 불안정성과 손실 급증과 같은 다양한 문제가 발생합니다. 이러한 현상은 치명적인 발산을 초래할 수 있으며, 이로 인해 비용이 많이 드는 체크포인트 복구와 데이터 배치 건너뛰기가 필요해질 수 있습니다. 상수 또는 노름 기반의 전통적인 그래디언트 클리핑 기법은 고정된 임계값이나 휴리스틱에 의존하기 때문에 이러한 문제를 효과적으로 해결하지 못하며, 비효율적인 학습을 초래하고 빈번한 수동 개입을 필요로 합니다. 본 연구에서는 시간에 따른 그래디언트 노름의 통계적 특성을 기반으로 클리핑 임계값을 동적으로 조정하는 적응형 그래디언트 클리핑 알고리즘인 ZClip을 제안합니다. 기존의 반응적 전략과 달리, ZClip은 그래디언트 노름의 규모와 시간적 변화에 대한 사전 가정 없이 학습 동역학에 능동적으로 적응합니다. 핵심적으로, ZClip은 z-점수 기반 이상 탐지를 활용하여 큰 그래디언트 급증을 식별하고 완화함으로써 악성 손실 급증을 방지하면서도 수렴에 방해가 되지 않도록 합니다. 우리의 코드는 https://github.com/bluorion-com/ZClip에서 확인할 수 있습니다.

English

Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively due to their reliance on fixed thresholds or heuristics, leading to inefficient learning and requiring frequent manual intervention. In this work, we propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time. Unlike prior reactive strategies, ZClip proactively adapts to training dynamics without making any prior assumptions on the scale and the temporal evolution of gradient norms. At its core, it leverages z-score-based anomaly detection to identify and mitigate large gradient spikes, preventing malignant loss spikes while not interfering with convergence otherwise. Our code is available at: https://github.com/bluorion-com/ZClip.

ZClip: LLM 사전 학습을 위한 적응형 스파이크 완화 기법

ZClip: Adaptive Spike Mitigation for LLM Pre-Training

초록

Support