ChatPaper.aiChatPaper

SimpleGPT:通过简单归一化策略改进GPT模型

SimpleGPT: Improving GPT via A Simple Normalization Strategy

February 1, 2026
作者: Marco Chen, Xianbiao Qi, Yelin He, Jiaquan Ye, Rong Xiao
cs.AI

摘要

本研究从二阶几何视角重新审视Transformer的优化问题,建立了架构设计、激活尺度、海森矩阵与最大可容忍学习率之间的直接联系。我们提出了一种名为SimpleNorm的简单归一化策略,通过构造方式稳定中间激活尺度。进而通过分析损失函数对网络激活值的海森矩阵,从理论上证明SimpleNorm能显著降低海森矩阵的谱范数,从而允许采用更大的稳定学习率。我们在10亿、14亿、70亿和80亿参数规模的大型GPT模型上进行了广泛实验,验证了理论发现。实证表明,基于SimpleNorm的网络SimpleGPT可承受比标准惯例高3-10倍的学习率,始终表现出强大的优化稳定性,且性能显著优于成熟基线。具体而言,在70亿参数模型上训练6万步时,SimpleGPT的训练损失比采用QKNorm的LLaMA2低0.08,将损失从2.290降至2.208。相关源代码将在https://github.com/Ocram7/SimpleGPT 发布。
English
In this work, we revisit Transformer optimization through the lens of second-order geometry and establish a direct connection between architectural design, activation scale, the Hessian matrix, and the maximum tolerable learning rate. We introduce a simple normalization strategy, termed SimpleNorm, which stabilizes intermediate activation scales by construction. Then, by analyzing the Hessian of the loss with respect to network activations, we theoretically show that SimpleNorm significantly reduces the spectral norm of the Hessian, thereby permitting larger stable learning rates. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B. Empirically, SimpleGPT, our SimpleNorm-based network, tolerates learning rates 3times-10times larger than standard convention, consistently demonstrates strong optimization stability, and achieves substantially better performance than well-established baselines. Specifically, when training 7B-scale models for 60K steps, SimpleGPT achieves a training loss that is 0.08 lower than that of LLaMA2 with QKNorm, reducing the loss from 2.290 to 2.208. Our source code will be released at https://github.com/Ocram7/SimpleGPT.
PDF21February 5, 2026