通过梯度分组调整学习率来驯服大语言模型
Taming LLMs by Scaling Learning Rates with Gradient Grouping
June 1, 2025
作者: Siyuan Li, Juanxi Tian, Zedong Wang, Xin Jin, Zicheng Liu, Wentao Zhang, Dan Xu
cs.AI
摘要
大规模语言模型(LLMs)的训练因其庞大的规模和异构架构而面临挑战。尽管自适应优化器如AdamW有助于应对梯度变化,但在高效且有效的参数级学习率估计方面仍存在困难,导致训练不稳定、收敛速度慢以及与参数高效微调(PEFT)技术的兼容性差。本研究提出了基于梯度分组的缩放优化器(SGG),通过动态分组和组特定缩放来改进自适应学习率估计。SGG首先将每一层的梯度统计量分组为簇,然后应用簇特定缩放来校准每个参数的学习率,从而在保持精确的逐参数适应的同时,施加集体组级约束。在多种(M)LLM基准测试上的实验表明,SGG能够无缝集成现有优化器,并在不同模型规模下提供一致的性能提升和更快的收敛速度。其在不同批量大小和学习率下的稳定性,确立了SGG作为LLM优化的稳健选择。
English
Training large language models (LLMs) poses challenges due to their massive
scale and heterogeneous architectures. While adaptive optimizers like AdamW
help address gradient variations, they still struggle with efficient and
effective parameter-wise learning rate estimation, resulting in training
instability, slow convergence, and poor compatibility with parameter-efficient
fine-tuning (PEFT) techniques. This work introduces Scaling with Gradient
Grouping (SGG), an optimizer wrapper that improves adaptive learning rate
estimation by dynamic grouping and group-specific scaling. SGG first groups
gradient statistics in each layer into clusters and then applies
cluster-specific scaling to calibrate learning rates for each parameter, thus
imposing collective group-wise constraints while maintaining precise
per-parameter adaptation. Experiments on diverse (M)LLM benchmarks show that
SGG integrates seamlessly with existing optimizers, and offers consistent gains
and faster convergence over baselines, with various model sizes. Its stability
across varying batch sizes and learning rates establishes SGG as a robust
choice for LLM optimization.Summary
AI-Generated Summary