ChatPaper.aiChatPaper

BPDQ:面向大语言模型的变网格位平面分解量化技术

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

February 4, 2026
作者: Junyu Chen, Jungang Li, Jing Xiong, Wenjie Wang, Qingyao Yang, He Xiao, Zhen Li, Taiqiang Wu, Mengzhao Chen, Zhen Peng, Chaofan Tao, Long Shi, Hongxia Yang, Ngai Wong
cs.AI

摘要

在大规模语言模型(LLM)推理过程中,资源受限的部署环境常受限于内存占用和内存带宽,使得量化成为高效服务的关键技术。虽然训练后量化(PTQ)在4比特位宽下能保持较高精度,但在2-3比特位宽时性能显著下降。究其根本,现有方法对每个参数组强制采用形状不变的量化网格(如UINT2的固定均匀间隔),严重限制了误差最小化的可行解空间。为此,我们提出位平面分解量化(BPDQ)方法:通过位平面与标量系数构建可变量化网格,利用近似二阶信息迭代优化网格参数,并逐级补偿量化误差以最小化输出差异。在2比特量化场景下,BPDQ可实现Qwen2.5-72B模型在单张RTX 3090显卡上的部署,GSM8K准确率达83.85%(对比16比特的90.83%)。此外,我们通过理论分析证明可变网格能扩展可行解空间,且量化过程始终与Hessian矩阵诱导的几何空间中的优化目标保持一致。代码详见:github.com/KingdalfGoodman/BPDQ。
English
Large language model (LLM) inference is often bounded by memory footprint and memory bandwidth in resource-constrained deployments, making quantization a fundamental technique for efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. Fundamentally, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using approximate second-order information while progressively compensating quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit). Moreover, we provide theoretical analysis showing that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. Code: github.com/KingdalfGoodman/BPDQ.
PDF62February 17, 2026