6Bit-Diffusion:面向视频扩散模型的推理时混合精度量化技术
6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models
March 19, 2026
作者: Rundong Su, Jintao Zhang, Zhihang Yuan, Haojie Duanmu, Jianfei Chen, Jun Zhu
cs.AI
摘要
扩散变换器在视频生成领域展现出卓越能力,但其实际部署受限于高内存占用与计算成本。训练后量化技术为降低内存使用和提升计算速度提供了实用路径。现有量化方法通常采用静态位宽分配策略,忽视了不同扩散时间步中激活值的量化难度,导致效率与质量间的权衡未能达到最优。本文提出一种推理时NVFP4/INT8混合精度量化框架:我们发现模块输入输出差值与其内部线性层的量化敏感度存在强线性关联,基于此设计轻量级预测器,动态分配NVFP4至时序稳定的层级以实现内存压缩最大化,同时选择性保留INT8给波动性层级确保鲁棒性。这种自适应精度策略可在保持生成质量的前提下实现激进量化。此外,我们观察到变换器模块输入输出残差在时间步间具有高度时序一致性,通过利用此时序冗余特性引入时序差分缓存机制,跳过不变模块的计算以进一步降低计算成本。大量实验表明,本方法可实现1.92倍端到端加速与3.32倍内存缩减,为视频扩散变换器的高效推理树立了新基准。
English
Diffusion transformers have demonstrated remarkable capabilities in generating videos. However, their practical deployment is severely constrained by high memory usage and computational cost. Post-Training Quantization provides a practical way to reduce memory usage and boost computation speed. Existing quantization methods typically apply a static bit-width allocation, overlooking the quantization difficulty of activations across diffusion timesteps, leading to a suboptimal trade-off between efficiency and quality. In this paper, we propose a inference time NVFP4/INT8 Mixed-Precision Quantization framework. We find a strong linear correlation between a block's input-output difference and the quantization sensitivity of its internal linear layers. Based on this insight, we design a lightweight predictor that dynamically allocates NVFP4 to temporally stable layers to maximize memory compression, while selectively preserving INT8 for volatile layers to ensure robustness. This adaptive precision strategy enables aggressive quantization without compromising generation quality. Beside this, we observe that the residual between the input and output of a Transformer block exhibits high temporal consistency across timesteps. Leveraging this temporal redundancy, we introduce Temporal Delta Cache (TDC) to skip computations for these invariant blocks, further reducing the computational cost. Extensive experiments demonstrate that our method achieves 1.92times end-to-end acceleration and 3.32times memory reduction, setting a new baseline for efficient inference in Video DiTs.