ShotStream:面向互動式敘事的串流多鏡頭影片生成技術
ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling
March 26, 2026
作者: Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, Tianfan Xue
cs.AI
摘要
多鏡頭影片生成對於長篇敘事至關重要,然而現有的雙向架構存在互動性有限與高延遲的問題。我們提出ShotStream——一種新穎的因果多鏡頭架構,能實現互動式敘事與高效的即時影格生成。通過將任務重新定義為基於歷史脈絡的下一鏡頭生成,ShotStream允許用戶透過串流提示動態指導進行中的敘事。我們首先將文字轉影片模型微調為雙向下一鏡頭生成器,再透過分佈匹配蒸餾法將其精簡為因果學生模型。為克服自回歸生成固有的鏡頭間一致性與錯誤累積難題,我們引入兩項關鍵創新:首先,雙快取記憶機制通過全局脈絡快傳保留條件影格以維持鏡頭間連貫性,局部脈絡快傳則暫存當前鏡頭內已生成影格以確保鏡頭內一致性,並採用RoPE間斷指示器明確區分兩種快取以消除歧義;其次,我們提出兩階段蒸餾策略對抗錯誤累積——先以真實歷史鏡頭為條件進行鏡頭內自強制訓練,再逐步過渡到基於自生成歷史的鏡頭間自強制訓練,有效彌合訓練與測試的差距。大量實驗表明,ShotStream能以次秒級延遲生成連貫的多鏡頭影片,在單一GPU上達到16 FPS,其品質媲美甚至超越速度更慢的雙向模型,為即時互動敘事開闢了新途徑。訓練與推論程式碼及模型已開源於我們的專案頁面。
English
Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our