初始化决定收敛域：面向极限大语言模型量化的高效码本优化

摘要

加法量化技术通过O(1)查表反量化实现了大语言模型的极致压缩，使其在边缘部署中极具吸引力。然而在2比特精度下，即使经过大量搜索和微调，该技术仍常出现灾难性失效。我们发现其核心瓶颈在于码本初始化——贪婪序列初始化常使模型陷入不良优化区域，后续的波束搜索和参数微调难以修正。通过表征比率ho=N/KM（表征权重组与码本容量的关系）分析这一现象后，我们提出OA-EM：一种基于Hessian加权马氏距离的输出感知EM初始化方法。在多种压缩率、搜索预算及三种架构（Llama 3.2 3B/ Llama 3.1 8B/ Qwen 2.5 3B）的测试中，OA-EM经参数微调后始终能获得更优解，并在质量-计算效率边界上保持领先。该瓶颈的严重程度随ho值变化：3比特每参数时表现中等，但在2比特每参数时极为突出——不当初始化会使困惑度恶化数个数量级。更广泛而言，我们的研究揭示了压缩模型空间中优化几何的重要性：初始化可能主导后续搜索与微调的效果。

English

Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio ho = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with ho: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.