ChatPaper.aiChatPaper

Adam-mini:使用更少的學習率獲得更多

Adam-mini: Use Fewer Learning Rates To Gain More

June 24, 2024
作者: Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun
cs.AI

摘要

我們提出了Adam-mini,這是一種優化器,其在記憶體佔用上比AdamW表現相當或更好,並且記憶體佔用量減少了45%至50%。Adam-mini通過降低Adam中學習率資源(即1/v)來減少記憶體使用量。我們發現,在v中,超過90%的這些學習率可以被安全地移除,方法是:(1)根據我們提出的海森矩陣結構原則,仔細將參數劃分為塊;(2)為每個參數塊分配一個單一但良好的學習率。我們進一步發現,對於每個這些參數塊,存在一個高質量的單一學習率,可以勝過Adam,前提是有足夠的資源來搜索它。然後,我們提供了一種成本效益的方法來找到良好的學習率,並提出Adam-mini。在實驗中,我們驗證了Adam-mini在從125M到7B的各種語言模型上進行預訓練、監督微調和RLHF時的表現與AdamW相當或更好。Adam-mini的減少記憶體佔用量還減輕了GPU和CPU之間的通信開銷,從而提高了吞吐量。例如,當在2倍A800-80GB GPU上對Llama2-7B進行預訓練時,Adam-mini的吞吐量比AdamW高出49.6%,節省了33%的預訓練時間。
English
We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., 1/v). We find that geq 90% of these learning rates in v could be harmlessly removed if we (1) carefully partition the parameters into blocks following our proposed principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We further find that, for each of these parameter blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. We then provide one cost-effective way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on 2times A800-80GB GPUs, which saves 33% wall-clock time for pre-training.

Summary

AI-Generated Summary

PDF694November 29, 2024