ChatPaper.aiChatPaper

僅透過前向傳遞微調語言模型

Fine-Tuning Language Models with Just Forward Passes

May 27, 2023
作者: Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, Sanjeev Arora
cs.AI

摘要

對語言模型(LMs)進行微調已經在各種下游任務上取得成功,但隨著LMs變得越來越大,反向傳播需要大量的記憶體,這是無法承受的。零階(ZO)方法原則上可以僅使用兩次前向傳播來估計梯度,但據推測對於優化大型模型來說速度極其緩慢。在這項工作中,我們提出了一種記憶體高效的零階優化器(MeZO),將經典的ZO-SGD方法調整為原地運行,從而以推理相同的記憶體占用量微調LMs。例如,使用單個A100 80GB GPU,MeZO可以訓練一個300億參數的模型,而使用反向傳播進行微調只能在相同預算下訓練一個27億LM。我們在各種模型類型(遮罩和自回歸LMs)、模型規模(高達660億)和下游任務(分類、多選和生成)上進行了全面的實驗。我們的結果表明,(1)MeZO明顯優於上下文學習和線性探測;(2)MeZO在多個任務上實現了與使用反向傳播進行微調相當的性能,記憶體減少高達12倍;(3)MeZO與全參數和參數高效調整技術(如LoRA和前綴調整)兼容;(4)MeZO可以有效地優化非可微目標(例如最大化準確性或F1)。我們用理論見解支持我們的實證發現,強調適當的預訓練和任務提示使MeZO能夠微調巨大的模型,盡管經典的ZO分析表明相反。
English
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.
PDF32December 15, 2024