自己回帰的ビデオ生成のための推測的デコーディング

要旨

オートリグレッシブ動画拡散は、ストリーミング動画合成における有望なパラダイムとして台頭しており、ステップ蒸留が推論加速の主要な手段となっている。大規模言語モデルにおける支配的な加速戦略である投機的デコーディングが、オートリグレッシブ動画生成に効果的に適応できるかどうかは未解決の問題であった。なぜなら、動画ブロックは連続的な時空間テンソルであり、正確な棄却サンプリングのためのトークンレベルの分布が存在しないためである。本研究では、**SDVG**を提案する。SDVGは、トークン検証を画像品質ルーターに置き換えることで、ブロックベースのオートリグレッシブ動画拡散に投機的デコーディングを導入する。1.3Bパラメータのドラフターが4回のノイズ除去ステップを経て候補ブロックを提案し、各ブロックはVAEでデコードされた後、ImageRewardによって最悪フレーム集約（全フレームの報酬の最小値を採用）を用いてスコアリングされる。これにより、平均化では見逃されがちな単一フレームのアーティファクトを検出する。固定閾値τ以上のスコアを得たブロックは14BパラメータのターゲットモデルのKVキャッシュに受け入れられ、それ以外はターゲットモデルによって再生成される。2つの追加的な設計選択が極めて重要であることが判明した：最初のブロックは常に強制的に棄却されシーン構成を固定し、τは単一の調整パラメータとして滑らかな品質-速度のパレートフロンティアを描く。1003のMovieGenVideoBenchプロンプト（832x480）を用いた評価では、τ=-0.7において、SDVGはターゲットモデルのみのVisionReward品質の98.1%（0.0773 vs. 0.0788）を維持しつつ1.59倍の高速化を達成し、95.7%の品質維持率では2.09倍の高速化に達した。一方、ドラフターのみの生成を一貫して17%以上上回った。本フレームワークは学習不要、アーキテクチャ変更不要であり、既存のオートリグレッシブ動画生成パイプラインにシームレスに統合可能である。

English

Autoregressive video diffusion is emerging as a promising paradigm for streaming video synthesis, with step distillation serving as the primary means of accelerating inference. Whether speculative decoding, the dominant acceleration strategy for large language models, can be effectively adapted to autoregressive video generation remains an open question, because video blocks are continuous spatiotemporal tensors with no token-level distribution for exact rejection sampling. We introduce SDVG, which brings speculative decoding to block-based autoregressive video diffusion by replacing token verification with an image-quality router. A 1.3B drafter proposes candidate blocks via four denoising steps; each block is VAE-decoded and scored by ImageReward using worst-frame aggregation--taking the minimum per-frame reward to catch single-frame artifacts that averaging would mask. Blocks scoring above a fixed threshold tau are accepted into the 14B target's KV cache; the rest are regenerated by the target. Two additional design choices prove critical: the first block is always force-rejected to anchor scene composition, and tau serves as a single knob that traces a smooth quality-speed Pareto frontier. On 1003 MovieGenVideoBench prompts (832x480), SDVG retains 98.1% of target-only VisionReward quality (0.0773 vs. 0.0788) at a 1.59x speedup with tau=-0.7, and reaches 2.09x at 95.7% quality retention--while consistently outperforming draft-only generation by over +17%. The framework is training-free, requires no architectural changes, and can be seamlessly integrated into existing autoregressive video generation pipelines.

自己回帰的ビデオ生成のための推測的デコーディング

Speculative Decoding for Autoregressive Video Generation

要旨

Support