仅通过前向传递微调语言模型

摘要

对语言模型（LMs）进行微调已经在各种下游任务上取得成功，但随着LMs规模的增长，反向传播需要大量的内存，这是无法承受的。零阶（ZO）方法原则上可以仅使用两次前向传递来估计梯度，但据推测，对于优化大型模型来说速度极慢。在这项工作中，我们提出了一种内存高效的零阶优化器（MeZO），将经典的ZO-SGD方法调整为原地操作，从而使LMs的微调具有与推理相同的内存占用。例如，使用单个A100 80GB GPU，MeZO可以训练一个300亿参数的模型，而使用反向传播进行微调只能在相同预算下训练一个27亿参数的LM。我们在不同模型类型（掩码和自回归LMs）、模型规模（高达660亿）和下游任务（分类、多选和生成）上进行了全面实验。我们的结果表明，（1）MeZO明显优于上下文学习和线性探测；（2）MeZO在多个任务上实现了与使用反向传播微调相当的性能，同时减少了高达12倍的内存；（3）MeZO与LoRA和前缀微调等全参数和参数高效微调技术兼容；（4）MeZO可以有效地优化非可微目标（例如，最大化准确性或F1）。我们用理论见解支持我们的实证发现，强调充分的预训练和任务提示如何使MeZO能够微调巨大的模型，尽管经典ZO分析表明相反。

English

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.

仅通过前向传递微调语言模型

Fine-Tuning Language Models with Just Forward Passes

摘要

Support