Q-ARVD: 自己回帰型ビデオ拡散モデルの量子化

要旨

自己回帰型ビデオ拡散モデル（ARVD）は、ストリーミング動画生成のための有望なアーキテクチャとして登場し、リアルタイムインタラクティブな動画生成や世界モデリングへの道を開いている。その潜在能力にもかかわらず、ARVDの推論コストの大きさは実用的な展開における主要な障壁であり、効率向上の自然な方向性としてモデル量子化が考えられる。しかし、ARVDに対する量子化はほとんど未開拓のままである。我々の実証分析によれば、標準的な拡散トランスフォーマー向けに開発された既存の量子化手法をARVDに直接適用すると、双方向拡散モデルで観察されるものとは異なる量子化挙動が明らかになり、準最適な性能しか得られない。本論文では、ARVDの量子化における2つの重要な課題を特定する：（C1）フレーム間での著しく不均衡な量子化感度。自己回帰生成中の誤差蓄積は、指数関数的な減衰パターンに従って、フレーム間で極端に偏った量子化感度を誘発する可能性がある。（C2）重みにおける顕著かつ不均一な外れ値パターン。重み分布は顕著な外れ値チャネルを示し、そのパターンはレイヤータイプやブロック深さによって大きく異なる。これらの問題に対処するため、我々は正確なARVD量子化のための新しいフレームワークであるQ-ARVDを提案する。（S1）著しく不均衡なフレーム単位の感度に対処するため、Q-ARVDは最終品質を考慮したフレーム重み付け機構を量子化目的関数に組み込む。（S2）不均一な外れ値による性能低下を防ぐため、Q-ARVDは外れ値対応適応的二重スケール量子化を導入する。これは任意のレイヤーにおける外れ値チャネルの有無と数を自動検出し、それらを分離して通常チャネルを保護する。広範な実験により、Q-ARVDの優位性が実証されている。

English

Autoregressive video diffusion models (ARVDs) have emerged as a promising architecture for streaming video generation, paving the way for real-time interactive video generation and world modeling. Despite their potential, the substantial inference cost of ARVDs remains a major obstacle to practical deployment, making model quantization a natural direction for improving efficiency. However, quantization for ARVDs remains largely unexplored. Our empirical analysis shows that directly applying existing quantization schemes developed for standard diffusion transformers to ARVDs leads to suboptimal performance, revealing quantization behaviors that differ from those observed in bidirectional diffusion models. In this paper, we identify two critical challenges in quantizing ARVDs: (C1) Highly unbalanced frame-wise quantization sensitivity. Error accumulation during autoregressive generation can induce severely skewed quantization sensitivity across frames, following an exponential-like decay pattern. (C2) Prominent and heterogeneous outlier patterns in weights. Weight distributions exhibit pronounced outlier channels, whose patterns vary substantially across layer types and block depths. To address these issues, we propose Q-ARVD, a novel framework for accurate ARVD quantization. (S1) To tackle the highly unbalanced frame-wise sensitivity, Q-ARVD incorporates a final-quality aware frame-weighting mechanism into the quantization objective. (S2) To prevent heterogeneous outliers from degrading performance, Q-ARVD introduces an outlier-aware adaptive dual-scale quantization, which automatically detects the presence and quantity of outlier channels for an arbitrary layer, and isolates them to protect normal channels. Extensive experiments demonstrate the superiority of Q-ARVD.