Adam-mini: より少ない学習率でより多くの成果を

要旨

我々はAdam-miniを提案する。これはAdamWと同等かそれ以上の性能を達成しつつ、メモリ使用量を45%から50%削減する最適化手法である。Adam-miniは、Adamにおける学習率リソース（すなわち1/v）を削減することでメモリ使用量を削減する。我々は、vにおける学習率の90%以上が無害に除去可能であることを発見した。これは、(1)提案されたヘッシアン構造に基づく原則に従ってパラメータをブロックに分割し、(2)各パラメータブロックに単一の適切な学習率を割り当てることで実現される。さらに、これらのパラメータブロックごとに、十分なリソースがあれば探索可能な単一の高品質な学習率が存在し、それがAdamを上回る性能を発揮し得ることを見出した。我々はその後、適切な学習率を見つけるためのコスト効率の良い方法を提供し、Adam-miniを提案する。実験的に、Adam-miniが125Mから7B規模の様々な言語モデルにおいて、事前学習、教師ありファインチューニング、RLHFにおいてAdamWと同等かそれ以上の性能を発揮することを検証した。Adam-miniの削減されたメモリ使用量は、GPUとCPU間の通信オーバーヘッドを軽減し、スループットを向上させる。例えば、Adam-miniは2台のA800-80GB GPUでLlama2-7Bを事前学習する際、AdamWよりも49.6%高いスループットを達成し、事前学習の実時間を33%節約する。

English

We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., 1/v). We find that geq 90% of these learning rates in v could be harmlessly removed if we (1) carefully partition the parameters into blocks following our proposed principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We further find that, for each of these parameter blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. We then provide one cost-effective way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on 2times A800-80GB GPUs, which saves 33% wall-clock time for pre-training.

Adam-mini: より少ない学習率でより多くの成果を

Adam-mini: Use Fewer Learning Rates To Gain More

要旨

Support