LoftQ: 大規模言語モデルのためのLoRAファインチューニング対応量子化

要旨

量子化は大規模言語モデル（LLM）を提供する上で不可欠な技術であり、最近ではLoRAファインチューニングにも応用されています。本研究では、事前学習済みモデルに量子化とLoRAファインチューニングを同時に適用するシナリオに焦点を当てます。このような場合、完全なファインチューニングと量子化＋LoRAファインチューニングのアプローチの間で、下流タスクにおける性能に一貫したギャップが観察されることが一般的です。これに対応して、我々はLoftQ（LoRA-Fine-Tuning-aware Quantization）を提案します。これは、LLMを量子化すると同時に、LoRAファインチューニングのための適切な低ランク初期化を見つける新しい量子化フレームワークです。この初期化により、量子化モデルと完全精度モデルの間の不一致が緩和され、下流タスクにおける汎化性能が大幅に向上します。我々は、自然言語理解、質問応答、要約、自然言語生成タスクにおいて本手法を評価しました。実験結果は、本手法が非常に有効であり、特に挑戦的な2ビットおよび2/4ビット混合精度の領域において、既存の量子化手法を凌駕することを示しています。我々はコードを公開する予定です。

English

Quantization is an indispensable technique for serving Large Language Models (LLMs) and has recently found its way into LoRA fine-tuning. In this work we focus on the scenario where quantization and LoRA fine-tuning are applied together on a pre-trained model. In such cases it is common to observe a consistent gap in the performance on downstream tasks between full fine-tuning and quantization plus LoRA fine-tuning approach. In response, we propose LoftQ (LoRA-Fine-Tuning-aware Quantization), a novel quantization framework that simultaneously quantizes an LLM and finds a proper low-rank initialization for LoRA fine-tuning. Such an initialization alleviates the discrepancy between the quantized and full-precision model and significantly improves the generalization in downstream tasks. We evaluate our method on natural language understanding, question answering, summarization, and natural language generation tasks. Experiments show that our method is highly effective and outperforms existing quantization methods, especially in the challenging 2-bit and 2/4-bit mixed precision regimes. We will release our code.

LoftQ: 大規模言語モデルのためのLoRAファインチューニング対応量子化

LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models

要旨

Support