通過梯度分組調整學習率來馴服大型語言模型

摘要

訓練大型語言模型（LLMs）因其龐大的規模和異質性架構而面臨挑戰。雖然像AdamW這樣的適應性優化器有助於應對梯度變化，但它們在高效且有效的逐參數學習率估計方面仍存在困難，導致訓練不穩定、收斂速度慢，以及與參數高效微調（PEFT）技術的兼容性差。本研究引入了基於梯度分組的縮放（SGG），這是一種優化器封裝，通過動態分組和組特定縮放來改進適應性學習率估計。SGG首先將每層的梯度統計量分組為簇，然後應用簇特定縮放來校準每個參數的學習率，從而施加集體組約束，同時保持精確的逐參數適應。在多樣化的（M）LLM基準測試中，實驗表明SGG能與現有優化器無縫集成，並在不同模型大小下提供一致的增益和更快的收斂速度。其在不同批次大小和學習率下的穩定性，確立了SGG作為LLM優化的穩健選擇。

English

Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures. While adaptive optimizers like AdamW help address gradient variations, they still struggle with efficient and effective parameter-wise learning rate estimation, resulting in training instability, slow convergence, and poor compatibility with parameter-efficient fine-tuning (PEFT) techniques. This work introduces Scaling with Gradient Grouping (SGG), an optimizer wrapper that improves adaptive learning rate estimation by dynamic grouping and group-specific scaling. SGG first groups gradient statistics in each layer into clusters and then applies cluster-specific scaling to calibrate learning rates for each parameter, thus imposing collective group-wise constraints while maintaining precise per-parameter adaptation. Experiments on diverse (M)LLM benchmarks show that SGG integrates seamlessly with existing optimizers, and offers consistent gains and faster convergence over baselines, with various model sizes. Its stability across varying batch sizes and learning rates establishes SGG as a robust choice for LLM optimization.

通過梯度分組調整學習率來馴服大型語言模型

Taming LLMs by Scaling Learning Rates with Gradient Grouping

摘要

Support