ChatPaper.aiChatPaper

訓練小規模,推斷大規模:面向大型語言模型的記憶體高效LoRA訓練

Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models

February 19, 2025
作者: Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Yang You, Guiming Xie, Xuejian Gong, Kunlong Zhou
cs.AI

摘要

大型語言模型(LLMs)在自然語言處理領域取得了顯著進展,展現出卓越的任務泛化能力。低秩適應(LoRA)提供了一種成本效益高的微調解決方案,它凍結原始模型參數,僅訓練輕量級的低秩適配矩陣。然而,LoRA的記憶體佔用主要由原始模型參數主導。為緩解此問題,我們提出了LoRAM,這是一種基於記憶體效率的LoRA訓練方案,其核心思想是:在過參數化的LLMs中,許多神經元在訓練時效用較低,但在推理時卻至關重要。LoRAM提出了一種獨特的轉折:它在一個經過剪枝的(小型)模型上進行訓練,以獲得剪枝後的低秩矩陣,然後將這些矩陣恢復並用於原始(大型)模型的推理。此外,模型發布者預先執行的最低成本持續預訓練,對齊了剪枝模型與原始模型之間的知識差異。我們的大量實驗證明了LoRAM在多種剪枝策略和下游任務中的有效性。對於一個擁有700億參數的模型,LoRAM使得僅需20G HBM的GPU即可進行訓練,取代了用於LoRA訓練的A100-80G GPU和用於完整微調的15個GPU。具體而言,結合結構化剪枝與4位元量化的QLoRAM,在LLaMA-3.1-70B(LLaMA-2-70B)上,將低秩矩陣訓練中主導記憶體使用的參數存儲成本降低了15.81倍(16.95倍),同時在性能上超越了原始LLaMA-3.1-70B(LLaMA-2-70B)和LoRA訓練的LLaMA-3.1-8B(LLaMA-2-13B)。
English
Large Language Models (LLMs) have significantly advanced natural language processing with exceptional task generalization capabilities. Low-Rank Adaption (LoRA) offers a cost-effective fine-tuning solution, freezing the original model parameters and training only lightweight, low-rank adapter matrices. However, the memory footprint of LoRA is largely dominated by the original model parameters. To mitigate this, we propose LoRAM, a memory-efficient LoRA training scheme founded on the intuition that many neurons in over-parameterized LLMs have low training utility but are essential for inference. LoRAM presents a unique twist: it trains on a pruned (small) model to obtain pruned low-rank matrices, which are then recovered and utilized with the original (large) model for inference. Additionally, minimal-cost continual pre-training, performed by the model publishers in advance, aligns the knowledge discrepancy between pruned and original models. Our extensive experiments demonstrate the efficacy of LoRAM across various pruning strategies and downstream tasks. For a model with 70 billion parameters, LoRAM enables training on a GPU with only 20G HBM, replacing an A100-80G GPU for LoRA training and 15 GPUs for full fine-tuning. Specifically, QLoRAM implemented by structured pruning combined with 4-bit quantization, for LLaMA-3.1-70B (LLaMA-2-70B), reduces the parameter storage cost that dominates the memory usage in low-rank matrix training by 15.81times (16.95times), while achieving dominant performance gains over both the original LLaMA-3.1-70B (LLaMA-2-70B) and LoRA-trained LLaMA-3.1-8B (LLaMA-2-13B).

Summary

AI-Generated Summary

PDF112February 20, 2025