小さく訓練し、大きく推論する：大規模言語モデルのためのメモリ効率の良いLoRA訓練

要旨

大規模言語モデル（LLM）は、優れたタスク汎化能力により自然言語処理を大幅に進化させてきました。Low-Rank Adaption（LoRA）は、元のモデルパラメータを凍結し、軽量な低ランクアダプタ行列のみを訓練する、コスト効率の良いファインチューニング手法を提供します。しかし、LoRAのメモリ使用量は主に元のモデルパラメータによって支配されています。この問題を緩和するため、我々はLoRAMを提案します。これは、過剰パラメータ化されたLLMにおいて多くのニューロンが訓練時の有用性は低いが推論時には不可欠であるという直観に基づいた、メモリ効率の良いLoRA訓練スキームです。LoRAMは独自のアプローチを採用しています：プルーニングされた（小規模な）モデルで訓練を行い、プルーニングされた低ランク行列を取得し、それを元の（大規模な）モデルで復元して推論に使用します。さらに、モデル提供者が事前に行う最小コストの継続事前学習により、プルーニングモデルと元のモデル間の知識の不一致を調整します。我々の広範な実験は、様々なプルーニング戦略と下流タスクにおいてLoRAMの有効性を実証しています。700億パラメータのモデルに対して、LoRAMは20G HBMのGPUでの訓練を可能にし、LoRA訓練用のA100-80G GPUと、完全なファインチューニング用の15個のGPUを置き換えます。具体的には、構造化プルーニングと4ビット量子化を組み合わせたQLoRAMは、LLaMA-3.1-70B（LLaMA-2-70B）において、低ランク行列訓練のメモリ使用量を支配するパラメータストレージコストを15.81倍（16.95倍）削減しつつ、元のLLaMA-3.1-70B（LLaMA-2-70B）およびLoRA訓練されたLLaMA-3.1-8B（LLaMA-2-13B）を上回る性能向上を達成しました。

English

Large Language Models (LLMs) have significantly advanced natural language processing with exceptional task generalization capabilities. Low-Rank Adaption (LoRA) offers a cost-effective fine-tuning solution, freezing the original model parameters and training only lightweight, low-rank adapter matrices. However, the memory footprint of LoRA is largely dominated by the original model parameters. To mitigate this, we propose LoRAM, a memory-efficient LoRA training scheme founded on the intuition that many neurons in over-parameterized LLMs have low training utility but are essential for inference. LoRAM presents a unique twist: it trains on a pruned (small) model to obtain pruned low-rank matrices, which are then recovered and utilized with the original (large) model for inference. Additionally, minimal-cost continual pre-training, performed by the model publishers in advance, aligns the knowledge discrepancy between pruned and original models. Our extensive experiments demonstrate the efficacy of LoRAM across various pruning strategies and downstream tasks. For a model with 70 billion parameters, LoRAM enables training on a GPU with only 20G HBM, replacing an A100-80G GPU for LoRA training and 15 GPUs for full fine-tuning. Specifically, QLoRAM implemented by structured pruning combined with 4-bit quantization, for LLaMA-3.1-70B (LLaMA-2-70B), reduces the parameter storage cost that dominates the memory usage in low-rank matrix training by 15.81times (16.95times), while achieving dominant performance gains over both the original LLaMA-3.1-70B (LLaMA-2-70B) and LoRA-trained LLaMA-3.1-8B (LLaMA-2-13B).

小さく訓練し、大きく推論する：大規模言語モデルのためのメモリ効率の良いLoRA訓練

Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models

要旨

Support