ChatPaper.aiChatPaper

QVGen:突破量化視頻生成模型的極限

QVGen: Pushing the Limit of Quantized Video Generative Models

May 16, 2025
作者: Yushi Huang, Ruihao Gong, Jing Liu, Yifu Ding, Chengtao Lv, Haotong Qin, Jun Zhang
cs.AI

摘要

視頻擴散模型(DMs)已實現了高質量的視頻合成。然而,其巨大的計算和記憶體需求對實際部署構成了嚴峻挑戰,即使在高階GPU上也是如此。作為一種普遍採用的解決方案,量化在降低圖像DMs成本方面取得了顯著成功,但其直接應用於視頻DMs卻效果不佳。本文中,我們提出了QVGen,這是一種專為在極低比特量化(例如4比特或以下)下實現高性能和推理效率的視頻DMs而設計的新型量化感知訓練(QAT)框架。我們首先進行理論分析,證明降低梯度範數對於促進QAT的收斂至關重要。為此,我們引入了輔助模塊(Phi)來減少量化誤差,從而顯著增強收斂性。為了消除Phi的推理開銷,我們提出了一種秩衰減策略,逐步消除Phi。具體而言,我們反覆使用奇異值分解(SVD)和提出的基於秩的正則化gamma來識別並衰減低貢獻組件。該策略在保持性能的同時,消除了推理開銷。在參數規模從1.3B到14B的四種最先進(SOTA)視頻DMs上進行的廣泛實驗表明,QVGen是首個在4比特設置下達到與全精度相當質量的方法。此外,它顯著優於現有方法。例如,我們的3比特CogVideoX-2B在VBench上的動態度和場景一致性分別提升了+25.28和+8.43。
English
Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present QVGen, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (e.g., 4-bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules (Phi) to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of Phi, we propose a rank-decay strategy that progressively eliminates Phi. Specifically, we repeatedly employ singular value decomposition (SVD) and a proposed rank-based regularization gamma to identify and decay low-contributing components. This strategy retains performance while zeroing out inference overhead. Extensive experiments across 4 state-of-the-art (SOTA) video DMs, with parameter sizes ranging from 1.3B sim14B, show that QVGen is the first to reach full-precision comparable quality under 4-bit settings. Moreover, it significantly outperforms existing methods. For instance, our 3-bit CogVideoX-2B achieves improvements of +25.28 in Dynamic Degree and +8.43 in Scene Consistency on VBench.

Summary

AI-Generated Summary

PDF31May 20, 2025