初期化が収束領域を決定する：超大規模言語モデルの極限量子化のための効率的なコードブック最適化

要旨

加法量子化は、O(1)ルックアップテーブルによる逆量子化を実現し、極端なLLM圧縮を可能にするため、エッジデプロイメントにおいて魅力的な技術である。しかし、2ビット精度では、大規模な探索とファインチューニングを施しても、しばしば壊滅的に失敗する。我々は、主要なボトルネックがコードブックの初期化にあることを示す。貪欲な逐次初期化は、モデルを最適化が困難な領域に配置することが多く、後続のビームサーチやPV-tuningでは克服が困難である。本論文では、重みグループとコードブック容量の関係を特徴づける表現比ho = N/KMを通じてこの挙動を分析し、ヘッセ行列重み付きマハラノビス距離を用いた出力考慮型EM初期化手法OA-EMを提案する。圧縮率、探索バジェット、および3つのアーキテクチャ（Llama 3.2 3B、Llama 3.1 8B、Qwen 2.5 3B）にわたる実験において、OA-EMはPV-tuning後により優れた解を一貫して生成し、品質と計算量のトレードオフ曲線を支配した。このボトルネックの深刻さはhoに比例して変化し、3 bppでは中程度であるが、2 bppでは極度に深刻となり、不適切な初期化によりパープレキシティが数桁悪化する可能性がある。より広義には、我々の結果は、初期化が後続の探索やファインチューニングを支配し得る、圧縮モデル空間における最適化幾何学の重要性を浮き彫りにする。

English

Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio ho = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with ho: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.

初期化が収束領域を決定する：超大規模言語モデルの極限量子化のための効率的なコードブック最適化

Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization

要旨

Support