ShotStream: インタラクティブなストーリーテリングのためのストリーミングマルチショット動画生成

要旨

マルチショット動画生成は長編ナラティブストーリーテリングにおいて重要であるが、現在の双方向アーキテクチャはインタラクティビティの限界と高いレイテンシに悩まされている。本論文では、インタラクティブなストーリーテリングと効率的なオンザフライフレーム生成を可能にする新しい因果的マルチショットアーキテクチャ「ShotStream」を提案する。タスクを履歴コンテキストを条件とした次ショット生成として再定式化することで、ShotStreamはユーザーがストリーミングプロンプトを通じて進行中のナラティブを動的に指示することを可能にする。これを実現するため、まずテキスト対動画モデルを双方向次ショット生成器にファインチューニングし、その後Distribution Matching Distillationを用いて因果的studentモデルへ蒸留する。自己回帰生成に内在するショット間の一貫性と誤差蓄積の課題を克服するため、二つの重要な革新を導入する。第一に、二重キャッシュメモリ機構により視覚的一貫性を保持する。グローバルコンテキストキャッシュはショット間一貫性のための条件フレームを保持し、ローカルコンテキストキャッシュは現在のショット内で生成されたフレームを保持してショット内一貫性を担保する。さらに、RoPE不連続インジケータを用いて二つのキャッシュを明示的に区別し、曖昧性を排除する。第二に、誤差蓄積を軽減するため、二段階蒸留戦略を提案する。これは正解履歴ショットを条件としたショット内自己強制から開始し、自己生成履歴を使用したショット間自己強制へ段階的に拡張され、学習と推論のギャップを効果的に埋める。大規模な実験により、ShotStreamがサブ秒レイテンシで一貫性のあるマルチショット動画を生成し、単一GPUで16 FPSを達成することが実証された。本手法は遅い双方向モデルの品質に匹敵あるいは凌駕し、リアルタイムインタラクティブストーリーテリングへの道を開くものである。学習および推論コード、ならびにモデルは当プロジェクトで公開されている。

English

Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our

ShotStream: インタラクティブなストーリーテリングのためのストリーミングマルチショット動画生成

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

要旨

Support