QWHA: 大規模言語モデルにおけるパラメータ効率の良いファインチューニングのための量子化対応ウォルシュ・アダマール適応法

要旨

大規模言語モデル（LLM）の効率的なデプロイメントに対する需要が高まる中、推論コストを削減する量子化と、トレーニングのオーバーヘッドを低減するパラメータ効率的なファインチューニング（PEFT）への関心が高まっています。これにより、正確でありながら効率的な量子化モデルを生成するための量子化対応PEFTの開発が進められています。この設定において、ファインチューニング前に量子化誤差を低減することが、高いモデル精度を達成するために重要です。しかし、低ランク適応に依存する既存の手法は、表現能力が限られているという課題を抱えています。最近のフーリエ関連変換（FT）ベースのアダプターは、低ランクアダプターよりも優れた表現力を提供しますが、量子化モデルに直接統合すると、誤差低減が効果的でなくなり、計算オーバーヘッドが増加する傾向があります。これらの制限を克服するため、我々はQWHAを提案します。QWHAは、ウォルシュ・アダマール変換（WHT）を変換カーネルとして使用し、適応的なパラメータ選択と値の洗練を組み込んだ新しいアダプター初期化スキームを採用することで、FTベースのアダプターを量子化モデルに統合します。QWHAは、量子化誤差を効果的に軽減しつつファインチューニングを容易にし、その設計により計算コストを大幅に削減することを実証します。実験結果は、QWHAが低ビット量子化精度において一貫してベースラインを上回り、既存のFTベースアダプターと比較して大幅なトレーニング速度向上を達成することを示しています。コードはhttps://github.com/vantaa89/qwhaで公開されています。

English

The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.

QWHA: 大規模言語モデルにおけるパラメータ効率の良いファインチューニングのための量子化対応ウォルシュ・アダマール適応法

QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models

要旨

Support