WUSH:面向大语言模型量化的近最优自适应变换
WUSH: Near-Optimal Adaptive Transforms for LLM Quantization
November 30, 2025
作者: Jiale Chen, Vage Egiazarian, Torsten Hoefler, Dan Alistarh
cs.AI
摘要
低比特量化是部署大语言模型的标准方法,然而少数极端权重和激活值会拉伸动态范围,降低量化器的有效分辨率。常见的改进方法是在量化前应用固定正交变换(如哈达玛矩阵),这通常能压缩动态范围。但这些变换忽略了数据统计特性,其最优性目前尚未得到理论阐释。本研究首次推导出闭式最优线性分块变换,可用于标准无数据量化器下权重-激活值联合量化的常见数值格式。具体而言,我们针对整数和浮点格式的最近舍入(RTN)及AbsMax缩放分块量化器,分别推导出最优自适应(数据感知)变换。最终构建的WUSH方案将哈达玛主干与基于二阶矩的数据依赖组件相结合,形成一种在温和假设下可证明最优的非正交变换,同时保持结构化以实现高效计算。初步实验结果表明,我们的方法在常见数值格式下持续优于哈达玛变换。
English
Quantization to low bitwidth is a standard approach for deploying large language models, however, a few extreme weights and activations stretch the dynamic range and reduce the effective resolution of the quantizer. A common mitigation approach is to apply some fixed orthogonal transforms, such as Hadamard matrices, before quantization, which typically reduces the dynamic range. Yet, these transforms ignore the statistics of the data, and their optimality is currently not understood. In this work, we derive, for the first time, closed-form optimal linear blockwise transforms for joint weight-activation quantization using standard data-free quantizers for common numerical formats. Specifically, we provide derivations of the optimal adaptive (data-aware) transforms for round-to-nearest (RTN), AbsMax-scaled block quantizers for both integer and floating-point formats. The resulting construction, which we call WUSH, combines a Hadamard backbone with a data-dependent component based on second-order moments, yielding a non-orthogonal transform that is provably optimal under mild assumptions and remains structured for efficient implementation. Preliminary experimental results show that our approach consistently improves upon the Hadamard transform for common formats.