Stream-T1: ストリーミング動画生成のためのテストタイムスケーリング

要旨

テストタイムスケーリング（TTS）は、トレーニングコストの急増なしにビデオ生成を強化する有望な方向性を示すが、拡散モデルに基づく現行のテストタイムビデオ生成手法は、過度な候補探索コストと時間的ガイダンスの欠如に悩まされている。これらの構造的ボトルネックに対処するため、我々はストリーミングビデオ生成への焦点転換を提案する。そのチャンク単位の合成と少数のノイズ除去ステップが本質的にTTSに適しており、計算オーバーヘッドを大幅に削減しながら細粒度の時間制御を可能にすることを見出した。この知見に基づき、ストリーミングビデオ生成に特化した先駆的な包括的TTSフレームワークであるStream-T1を導入した。具体的には、Stream-T1は3つのユニットで構成される：（1）**ストリームスケールドノイズ伝播**は、履歴で実証された高品質な前チャンクノイズを活用して生成チャンクの初期潜在ノイズを能動的に精緻化し、時間的依存性を効果的に確立するとともに、履歴ガウス事前分布を利用して現在の生成を誘導する。（2）**ストリームスケールド報酬枝刈り**は、生成候補を包括的に評価し、短期的評価とスライディングウィンドウに基づく長期的評価を統合することで、局所的空间的審美性と大域的時間的一貫性の最適なバランスを達成する。（3）**ストリームスケールドメモリシンキング**は、KVキャッシュから追い出されたコンテキストを報酬フィードバックに導かれた異なる更新経路に動的にルーティングし、過去に生成された視覚情報が後続のビデオストリームを効果的に固定・誘導することを保証する。5秒および30秒の包括的ビデオベンチマークで評価した結果、Stream-T1は時間的一貫性、動きの滑らかさ、フレームレベルの視覚的品質を大幅に改善し、顕著な優位性を示した。

English

While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video generation methods based on diffusion models suffer from exorbitant candidate exploration costs and lack temporal guidance. To address these structural bottlenecks, we propose shifting the focus to streaming video generation. We identify that its chunk-level synthesis and few denoising steps are intrinsically suited for TTS, significantly lowering computational overhead while enabling fine-grained temporal control. Driven by this insight, we introduced Stream-T1, a pioneering comprehensive TTS framework exclusively tailored for streaming video generation. Specifically, Stream-T1 is composed of three units: (1) Stream -Scaled Noise Propagation, which actively refines the initial latent noise of the generating chunk using historically proven, high-quality previous chunk noise, effectively establishes temporal dependency and utilizing the historical Gaussian prior to guide the current generation; (2) Stream -Scaled Reward Pruning, which comprehensively evaluates generated candidates to strike an optimal balance between local spatial aesthetics and global temporal coherence by integrating immediate short-term assessments with sliding-window-based long-term evaluations; (3) Stream-Scaled Memory Sinking, which dynamically routes the context evicted from KV-cache into distinct updating pathways guided by the reward feedback, ensuring that previously generated visual information effectively anchors and guides the subsequent video stream. Evaluated on both 5s and 30s comprehensive video benchmarks, Stream-T1 demonstrates profound superiority, significantly improving temporal consistency, motion smoothness, and frame-level visual quality.

Stream-T1: ストリーミング動画生成のためのテストタイムスケーリング

Stream-T1: Test-Time Scaling for Streaming Video Generation

要旨

Support