MSign：一种通过稳定秩恢复防止大型语言模型训练不稳定的优化器

摘要

训练不稳定性始终是大型语言模型预训练中的关键挑战，常表现为突发的梯度爆炸，造成大量计算资源浪费。我们通过μP缩放的500万参数NanoGPT模型研究训练故障，发现崩溃前会出现两个关键现象：（1）权重矩阵稳定秩（Frobenius范数平方与谱范数平方之比）快速下降；（2）相邻层雅可比矩阵间对齐度持续增强。我们从理论上证明这两种条件共同导致梯度范数随网络深度呈指数级增长。为打破这种不稳定性机制，我们提出MSign优化器，通过周期性应用矩阵符号运算来恢复稳定秩。在500万至30亿参数模型上的实验表明，MSign能以低于7.0%的计算开销有效防止训练故障。

English

Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via μP, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.