通過梯度分組調整學習率來馴服大型語言模型
Taming LLMs by Scaling Learning Rates with Gradient Grouping
June 1, 2025
作者: Siyuan Li, Juanxi Tian, Zedong Wang, Xin Jin, Zicheng Liu, Wentao Zhang, Dan Xu
cs.AI
摘要
訓練大型語言模型(LLMs)因其龐大的規模和異質性架構而面臨挑戰。雖然像AdamW這樣的適應性優化器有助於應對梯度變化,但它們在高效且有效的逐參數學習率估計方面仍存在困難,導致訓練不穩定、收斂速度慢,以及與參數高效微調(PEFT)技術的兼容性差。本研究引入了基於梯度分組的縮放(SGG),這是一種優化器封裝,通過動態分組和組特定縮放來改進適應性學習率估計。SGG首先將每層的梯度統計量分組為簇,然後應用簇特定縮放來校準每個參數的學習率,從而施加集體組約束,同時保持精確的逐參數適應。在多樣化的(M)LLM基準測試中,實驗表明SGG能與現有優化器無縫集成,並在不同模型大小下提供一致的增益和更快的收斂速度。其在不同批次大小和學習率下的穩定性,確立了SGG作為LLM優化的穩健選擇。
English
Training large language models (LLMs) poses challenges due to their massive
scale and heterogeneous architectures. While adaptive optimizers like AdamW
help address gradient variations, they still struggle with efficient and
effective parameter-wise learning rate estimation, resulting in training
instability, slow convergence, and poor compatibility with parameter-efficient
fine-tuning (PEFT) techniques. This work introduces Scaling with Gradient
Grouping (SGG), an optimizer wrapper that improves adaptive learning rate
estimation by dynamic grouping and group-specific scaling. SGG first groups
gradient statistics in each layer into clusters and then applies
cluster-specific scaling to calibrate learning rates for each parameter, thus
imposing collective group-wise constraints while maintaining precise
per-parameter adaptation. Experiments on diverse (M)LLM benchmarks show that
SGG integrates seamlessly with existing optimizers, and offers consistent gains
and faster convergence over baselines, with various model sizes. Its stability
across varying batch sizes and learning rates establishes SGG as a robust
choice for LLM optimization.