QVGen:突破量化视频生成模型的极限
QVGen: Pushing the Limit of Quantized Video Generative Models
May 16, 2025
作者: Yushi Huang, Ruihao Gong, Jing Liu, Yifu Ding, Chengtao Lv, Haotong Qin, Jun Zhang
cs.AI
摘要
视频扩散模型(DMs)已实现高质量视频合成。然而,其巨大的计算和内存需求对实际部署构成了严峻挑战,即便在高性能GPU上也是如此。作为普遍采用的解决方案,量化在降低图像DMs成本方面已取得显著成功,但其直接应用于视频DMs却效果不佳。本文提出QVGen,一种专为高性能且推理高效的视频DMs设计的量化感知训练(QAT)框架,适用于极低比特量化(如4位或更低)。我们首先通过理论分析表明,降低梯度范数对于促进QAT的收敛至关重要。为此,我们引入辅助模块(Phi)以减少大量化误差,从而显著提升收敛性。为消除Phi的推理开销,我们提出了一种秩衰减策略,逐步淘汰Phi。具体而言,我们反复运用奇异值分解(SVD)及提出的基于秩的正则化因子gamma,以识别并衰减贡献较低的成分。该策略在保持性能的同时,消除了推理开销。在涵盖1.3B至14B参数规模的4种最先进(SOTA)视频DMs上的广泛实验表明,QVGen是首个在4位设置下达到全精度可比质量的方法。此外,它显著优于现有方法。例如,我们的3位CogVideoX-2B在VBench上的动态度(Dynamic Degree)和场景一致性(Scene Consistency)分别提升了+25.28和+8.43。
English
Video diffusion models (DMs) have enabled high-quality video synthesis. Yet,
their substantial computational and memory demands pose serious challenges to
real-world deployment, even on high-end GPUs. As a commonly adopted solution,
quantization has proven notable success in reducing cost for image DMs, while
its direct application to video DMs remains ineffective. In this paper, we
present QVGen, a novel quantization-aware training (QAT) framework tailored for
high-performance and inference-efficient video DMs under extremely low-bit
quantization (e.g., 4-bit or below). We begin with a theoretical analysis
demonstrating that reducing the gradient norm is essential to facilitate
convergence for QAT. To this end, we introduce auxiliary modules (Phi) to
mitigate large quantization errors, leading to significantly enhanced
convergence. To eliminate the inference overhead of Phi, we propose a
rank-decay strategy that progressively eliminates Phi. Specifically, we
repeatedly employ singular value decomposition (SVD) and a proposed rank-based
regularization gamma to identify and decay low-contributing
components. This strategy retains performance while zeroing out inference
overhead. Extensive experiments across 4 state-of-the-art (SOTA) video DMs,
with parameter sizes ranging from 1.3B sim14B, show that QVGen is the
first to reach full-precision comparable quality under 4-bit settings.
Moreover, it significantly outperforms existing methods. For instance, our
3-bit CogVideoX-2B achieves improvements of +25.28 in Dynamic Degree and
+8.43 in Scene Consistency on VBench.Summary
AI-Generated Summary