フォワードパスのみで言語モデルをファインチューニング

要旨

言語モデル（LM）のファインチューニングは、多様な下流タスクで成功を収めてきた。しかし、LMのサイズが大きくなるにつれ、バックプロパゲーションには膨大なメモリが必要となり、実用的でなくなる。ゼロ次（ZO）法は、原理的には2回のフォワードパスのみで勾配を推定できるが、大規模モデルの最適化には極めて遅いと理論的に考えられてきた。本研究では、メモリ効率の良いゼロ次最適化手法（MeZO）を提案し、古典的なZO-SGD法をインプレースで動作するように適応させることで、推論時と同等のメモリフットプリントでLMをファインチューニングする。例えば、単一のA100 80GB GPUを使用した場合、MeZOは300億パラメータのモデルを訓練できるが、バックプロパゲーションによるファインチューニングでは同じ予算で2.7BのLMしか訓練できない。我々は、モデルタイプ（マスク型および自己回帰型LM）、モデルスケール（最大66B）、下流タスク（分類、多肢選択、生成）にわたる包括的な実験を実施した。その結果、(1) MeZOはインコンテキスト学習や線形プローブを大幅に上回る、(2) MeZOは複数のタスクにおいてバックプロパゲーションによるファインチューニングと同等の性能を達成しつつ、最大12倍のメモリ削減を実現する、(3) MeZOは全パラメータチューニングとLoRAやプレフィックスチューニングなどのパラメータ効率的なチューニング手法の両方と互換性がある、(4) MeZOは微分不可能な目的関数（例えば、精度やF1の最大化）を効果的に最適化できる、ことが示された。我々は、古典的なZO分析が示唆するものとは異なり、適切な事前学習とタスクプロンプトがMeZOによる巨大モデルのファインチューニングを可能にすることを理論的洞察を通じて支持する。

English

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.

フォワードパスのみで言語モデルをファインチューニング

Fine-Tuning Language Models with Just Forward Passes

要旨

Support