ChatPaper.aiChatPaper

6Bit-Diffusion:视频扩散模型的推理时混合精度量化技术

6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models

March 19, 2026
作者: Rundong Su, Jintao Zhang, Zhihang Yuan, Haojie Duanmu, Jianfei Chen, Jun Zhu
cs.AI

摘要

扩散变换器在视频生成领域展现出卓越能力,但其实际部署受限于高内存占用与计算成本。训练后量化技术为降低内存消耗、提升计算速度提供了实用路径。现有量化方法通常采用静态位宽分配策略,忽略了扩散过程中不同时间步激活值的量化难度,导致效率与质量之间的权衡未能达到最优。本文提出一种推理时NVFP4/INT8混合精度量化框架:通过发现模块输入输出差异与其内部线性层量化敏感度存在强线性关联,我们设计轻量级预测器动态分配NVFP4至时序稳定层以最大化内存压缩,同时选择性保留INT8给波动层确保鲁棒性。这种自适应精度策略可在保持生成质量的前提下实现激进量化。此外,我们观察到变换器模块输入输出残差在时间步间具有高度时序一致性,基于此时序冗余特性引入时序差分缓存机制,通过跳过不变模块的计算进一步降低运算成本。大量实验表明,本方法可实现1.92倍端到端加速与3.32倍内存压缩,为视频扩散变换器的高效推理建立了新基准。
English
Diffusion transformers have demonstrated remarkable capabilities in generating videos. However, their practical deployment is severely constrained by high memory usage and computational cost. Post-Training Quantization provides a practical way to reduce memory usage and boost computation speed. Existing quantization methods typically apply a static bit-width allocation, overlooking the quantization difficulty of activations across diffusion timesteps, leading to a suboptimal trade-off between efficiency and quality. In this paper, we propose a inference time NVFP4/INT8 Mixed-Precision Quantization framework. We find a strong linear correlation between a block's input-output difference and the quantization sensitivity of its internal linear layers. Based on this insight, we design a lightweight predictor that dynamically allocates NVFP4 to temporally stable layers to maximize memory compression, while selectively preserving INT8 for volatile layers to ensure robustness. This adaptive precision strategy enables aggressive quantization without compromising generation quality. Beside this, we observe that the residual between the input and output of a Transformer block exhibits high temporal consistency across timesteps. Leveraging this temporal redundancy, we introduce Temporal Delta Cache (TDC) to skip computations for these invariant blocks, further reducing the computational cost. Extensive experiments demonstrate that our method achieves 1.92times end-to-end acceleration and 3.32times memory reduction, setting a new baseline for efficient inference in Video DiTs.
PDF31March 27, 2026