ChatPaper.aiChatPaper

Adam-mini:使用更少的学习率获得更多收益

Adam-mini: Use Fewer Learning Rates To Gain More

June 24, 2024
作者: Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun
cs.AI

摘要

我们提出了Adam-mini,这是一种优化器,其性能与AdamW相当或更好,但内存占用减少了45%至50%。Adam-mini通过减少Adam中的学习率资源(即1/v)来降低内存占用。我们发现v中超过90%的学习率可以被安全地移除,方法是:(1)根据我们提出的Hessian结构原则将参数精心分区成块;(2)为每个参数块分配一个单一但良好的学习率。我们进一步发现,对于每个参数块,存在一个高质量的单一学习率可以胜过Adam,只要有足够的资源来搜索它。然后,我们提供了一种寻找良好学习率的经济有效方法,并提出了Adam-mini。在经验上,我们验证了Adam-mini在从125M到7B的各种规模的语言模型上进行预训练、监督微调和RLHF时的性能与AdamW相当或更好。Adam-mini的减少内存占用还减轻了GPU和CPU之间的通信开销,从而提高了吞吐量。例如,在2个A800-80GB GPU上预训练Llama2-7B时,Adam-mini的吞吐量比AdamW高出49.6%,节省了33%的预训练时间。
English
We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., 1/v). We find that geq 90% of these learning rates in v could be harmlessly removed if we (1) carefully partition the parameters into blocks following our proposed principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We further find that, for each of these parameter blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. We then provide one cost-effective way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on 2times A800-80GB GPUs, which saves 33% wall-clock time for pre-training.

Summary

AI-Generated Summary

PDF694November 29, 2024