ChatPaper.aiChatPaper

初始化决定收敛域:面向极限大语言模型量化的高效码本优化

Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization

April 9, 2026
作者: Ian W. Kennedy, Nafise Sadat Moosavi
cs.AI

摘要

加法量化技术通过O(1)查找表反量化实现了极限LLM压缩,使其在边缘部署中极具吸引力。然而在2比特精度下,即使经过大量搜索和微调,该技术仍常出现灾难性失效。我们发现其核心瓶颈在于码本初始化——贪婪序列初始化常使模型陷入不良优化区域,后续的束搜索和参数微调难以克服此问题。通过表征比率ho=N/KM(表征权重组与码本容量的关系)分析该现象后,我们提出OA-EM:一种基于Hessian加权马氏距离的输出感知EM初始化方法。在多种压缩率、搜索预算及三种架构(Llama 3.2 3B/3.1 8B/Qwen 2.5 3B)的测试中,OA-EM经参数微调后始终获得更优解,并主导质量-计算边界曲线。该瓶颈的严重程度随ho值放大:在3比特/参数时表现中等,但在2比特/参数时极为显著——不当初始化可使困惑度恶化数个数量级。更广泛而言,我们的研究揭示了压缩模型空间中优化几何的重要性,其中初始化可主导后续搜索与微调的效果。
English
Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio ho = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with ho: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.
PDF12April 14, 2026