通过梯度分组调整学习率来驯服大语言模型

摘要

大规模语言模型（LLMs）的训练因其庞大的规模和异构架构而面临挑战。尽管自适应优化器如AdamW有助于应对梯度变化，但在高效且有效的参数级学习率估计方面仍存在困难，导致训练不稳定、收敛速度慢以及与参数高效微调（PEFT）技术的兼容性差。本研究提出了基于梯度分组的缩放优化器（SGG），通过动态分组和组特定缩放来改进自适应学习率估计。SGG首先将每一层的梯度统计量分组为簇，然后应用簇特定缩放来校准每个参数的学习率，从而在保持精确的逐参数适应的同时，施加集体组级约束。在多种（M）LLM基准测试上的实验表明，SGG能够无缝集成现有优化器，并在不同模型规模下提供一致的性能提升和更快的收敛速度。其在不同批量大小和学习率下的稳定性，确立了SGG作为LLM优化的稳健选择。

English

Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures. While adaptive optimizers like AdamW help address gradient variations, they still struggle with efficient and effective parameter-wise learning rate estimation, resulting in training instability, slow convergence, and poor compatibility with parameter-efficient fine-tuning (PEFT) techniques. This work introduces Scaling with Gradient Grouping (SGG), an optimizer wrapper that improves adaptive learning rate estimation by dynamic grouping and group-specific scaling. SGG first groups gradient statistics in each layer into clusters and then applies cluster-specific scaling to calibrate learning rates for each parameter, thus imposing collective group-wise constraints while maintaining precise per-parameter adaptation. Experiments on diverse (M)LLM benchmarks show that SGG integrates seamlessly with existing optimizers, and offers consistent gains and faster convergence over baselines, with various model sizes. Its stability across varying batch sizes and learning rates establishes SGG as a robust choice for LLM optimization.

通过梯度分组调整学习率来驯服大语言模型

Taming LLMs by Scaling Learning Rates with Gradient Grouping

摘要

Support