BPDQ: 대규모 언어 모델을 위한 가변 그리드 기반 비트 평면 분해 양자화

초록

대규모 언어 모델(LLM) 추론은 리소스가 제한된 환경에서 메모리 사용량과 메모리 대역폭에 의해 종종 제한되며, 이는 효율적인 서빙을 위한 기본 기술로 양자화를 부각시킵니다. 사후 학습 양자화(PTQ)는 4비트에서 높은 정확도를 유지하지만, 2-3비트에서는 성능이 저하됩니다. 근본적으로 기존 방법들은 각 그룹에 대해 형태 불변 양자화 격자(예: UINT2의 고정된 균일 간격)를 적용함으로써 오류 최소화를 위한 가능한 해 집합을 심각하게 제한합니다. 이를 해결하기 위해 우리는 비트 평면과 스칼라 계수를 통해 가변 양자화 격자를 구성하고, 근사 2차 정보를 사용하여 이를 반복적으로 개선하면서 양자화 오류를 점진적으로 보상하여 출력 차이를 최소화하는 Bit-Plane Decomposition Quantization(BPDQ)을 제안합니다. 2비트 영역에서 BPDQ는 단일 RTX 3090으로 Qwen2.5-72B를 서빙하며 83.85%의 GSM8K 정확도(16비트 대비 90.83%)를 달성합니다. 더 나아가, 우리는 가변 격자가 가능한 해 집합을 확장하며, 양자화 과정이 헤세 행렬로 유도된 기하 구조 내에서 최적화 목표와 지속적으로 일치함을 보이는 이론적 분석을 제공합니다. 코드: github.com/KingdalfGoodman/BPDQ.

English

Large language model (LLM) inference is often bounded by memory footprint and memory bandwidth in resource-constrained deployments, making quantization a fundamental technique for efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. Fundamentally, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using approximate second-order information while progressively compensating quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit). Moreover, we provide theoretical analysis showing that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. Code: github.com/KingdalfGoodman/BPDQ.

BPDQ: 대규모 언어 모델을 위한 가변 그리드 기반 비트 평면 분해 양자화

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

초록

Support