ChatPaper.aiChatPaper

仅通过前向传递微调语言模型

Fine-Tuning Language Models with Just Forward Passes

May 27, 2023
作者: Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, Sanjeev Arora
cs.AI

摘要

对语言模型(LMs)进行微调已经在各种下游任务上取得成功,但随着LMs规模的增长,反向传播需要大量的内存,这是无法承受的。零阶(ZO)方法原则上可以仅使用两次前向传递来估计梯度,但据推测,对于优化大型模型来说速度极慢。在这项工作中,我们提出了一种内存高效的零阶优化器(MeZO),将经典的ZO-SGD方法调整为原地操作,从而使LMs的微调具有与推理相同的内存占用。例如,使用单个A100 80GB GPU,MeZO可以训练一个300亿参数的模型,而使用反向传播进行微调只能在相同预算下训练一个27亿参数的LM。我们在不同模型类型(掩码和自回归LMs)、模型规模(高达660亿)和下游任务(分类、多选和生成)上进行了全面实验。我们的结果表明,(1)MeZO明显优于上下文学习和线性探测;(2)MeZO在多个任务上实现了与使用反向传播微调相当的性能,同时减少了高达12倍的内存;(3)MeZO与LoRA和前缀微调等全参数和参数高效微调技术兼容;(4)MeZO可以有效地优化非可微目标(例如,最大化准确性或F1)。我们用理论见解支持我们的实证发现,强调充分的预训练和任务提示如何使MeZO能够微调巨大的模型,尽管经典ZO分析表明相反。
English
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.
PDF32December 15, 2024