大規模言語モデルの量子化における符号付き勾配降下法による重み丸めの最適化

要旨

大規模言語モデル（LLM）は、言語関連タスクにおいてその卓越した能力を証明してきた。しかし、その展開には、膨大なメモリとストレージ要件が伴うため、大きな課題が存在する。この問題に対応するため、特に3ビットおよび4ビットの重みのみの量子化が、最も有効な解決策の一つとして浮上している。ビット数が減少するにつれて、量子化グリッドが広がり、その結果、切り上げと切り下げの重要性が強調される。これまでの研究では、摂動を加えた切り上げと切り下げの微調整が、一部のシナリオで精度を向上させることが示されてきたが、本研究は、これらの摂動の正確で限定的な境界に焦点を当てており、切り上げ値を変更するための閾値のみが重要である。その結果、重みの丸めタスクを最適化するための簡潔で非常に効果的なアプローチを提案する。我々の手法は、SignRoundと名付けられ、符号付き勾配降下法を用いた軽量なブロック単位のチューニングを採用し、400ステップ以内で優れた結果を達成する。SignRoundは、最近の手法と比較しても印象的な性能を発揮し、追加の推論オーバーヘッドを導入することなく、既存の最近傍丸め（RTN）ベースラインを上回る。ソースコードは、まもなくhttps://github.com/intel/neural-compressorで公開される予定である。

English

Large Language Models (LLMs) have proven their exceptional capabilities in performing language-related tasks. However, their deployment poses significant challenges due to their considerable memory and storage requirements. In response to this issue, weight-only quantization, particularly 3 and 4-bit weight-only quantization, has emerged as one of the most viable solutions. As the number of bits decreases, the quantization grid broadens, thus emphasizing the importance of up and down rounding. While previous studies have demonstrated that fine-tuning up and down rounding with the addition of perturbations can enhance accuracy in some scenarios, our study is driven by the precise and limited boundary of these perturbations, where only the threshold for altering the rounding value is of significance. Consequently, we propose a concise and highly effective approach for optimizing the weight rounding task. Our method, named SignRound, involves lightweight block-wise tuning using signed gradient descent, enabling us to achieve outstanding results within 400 steps. SignRound outperforms the established baseline of rounding-to-nearest (RTN) and competes impressively against recent methods, without introducing additional inference overhead. The source code will be publicly available at https://github.com/intel/neural-compressor soon.

大規模言語モデルの量子化における符号付き勾配降下法による重み丸めの最適化

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

要旨

Support