AdaState: ストリーミング動画生成のための自己進化型アンカー

要旨

自己回帰型ビデオ拡散モデルは、フレームを逐次的に生成し、各チャンクを以前に生成されたコンテンツに条件付けながら、ストリーミングビデオを生成する。これらのモデルは構造的に最初のフレームに固定されている。すなわち、そのキー・バリュー表現はアテンションキャッシュ内で特権的な位置を占め、生成全体を通じて主要なシーン参照点として機能する。キャッシュ内で最もクリーンで誤差の少ない位置であるこのアンカーは、不均衡な注意を引きつけ、ビデオのダイナミクスを抑制し、シーンが自然に変化してもシーン構成を初期視点に固定する。その結果、動き、カメラ移動、シーンの進行が静的ー貫性のために抑制された、時間的に浅いビデオが生成される。この問題に対処するため、我々は静的アンカーを適応的状態に置き換える。これは隠れ潜在変数であり、各チャンクにおいてコンテンツと共にモデルがノイズ除去を行うが、レンダリングは行わない。モデルは凍結された最初のフレームを参照する代わりに、前の状態と現在のコンテンツの両方に注意を向けることで各ステップで自身のシーンアンカーを生成し、生成されたコンテンツとともに進化する参照を生成する。時間の絶対的な概念を符号化する標準的なビデオ生成とは異なり、我々の定式化は時間を相対的に扱う。すなわち、すべての生成ステップは、生成がどの程度進行したかに関わらず、同じ位置構造を見ており、状態遷移はすべてのチャンクで同一である。これらの特性により、生成プロセスに再帰性が導入され、ノイズ除去が遷移関数として機能し、KVキャッシュがそのキャリアとして機能するため、外部モジュールは不要となる。実験により、適応的状態がビデオのダイナミクスを大幅に改善し、生成されたビデオ内でより豊かな動きと自然なシーン進行を可能にすることが示された。

English

Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These models are structurally anchored to the first frame: its key-value representation occupies a privileged position in the attention cache and serves as the primary scene reference throughout generation. As the cleanest and most error-free position in the cache, this anchor draws disproportionate attention, suppressing video dynamics, and locking scene composition to the initial viewpoint even as the scene naturally evolves. The result is a temporally shallow video in which motion, camera movement, and scene progression are dampened in favor of static consistency. To address this, we replace the static anchor with an adaptive state, a hidden latent that the model denoises alongside content at every chunk but never renders. Rather than referencing a frozen first frame, the model generates its own scene anchor at each step by attending to both the previous state and the current content, producing a reference that evolves with the generated content. Unlike standard video generation, which encodes an absolute notion of time, our formulation treats time as relative: every generation step sees the same positional structure regardless of how far generation has progressed, and the state transition is identical at every chunk. Together, these properties introduce a recurrence into the generation process, where denoising serves as the transition function, and the KV cache serves as the carrier, requiring no external module. Experiments demonstrate that the adaptive state substantially improves video dynamics, enabling richer motion and natural scene progression within generated videos.