ShotStream：面向交互式叙事的流式多镜头视频生成技术

摘要

多镜头视频生成对于长叙事故事讲述至关重要，然而当前的双向架构存在交互性有限和延迟较高的问题。我们提出ShotStream——一种新颖的因果多镜头架构，能够实现交互式故事讲述和高效的实时帧生成。通过将任务重新定义为基于历史上下文条件的下一镜头生成，ShotStream允许用户通过流式提示动态指导正在进行的叙事。我们首先将文本到视频模型微调为双向下一镜头生成器，然后通过分布匹配蒸馏将其提炼为因果学生模型。为克服自回归生成中固有的镜头间一致性和错误累积挑战，我们引入两项关键创新：首先，采用双缓存记忆机制保持视觉连贯性——全局上下文缓存保留条件帧以确保镜头间一致性，局部上下文缓存存储当前镜头内生成的帧以保证镜头内一致性；同时使用RoPE间断指示器显式区分两个缓存以消除歧义。其次，为缓解错误累积，我们提出两阶段蒸馏策略：从基于真实历史镜头的镜头内自强制开始，逐步扩展到使用自生成历史的镜头间自强制，有效弥合训练与测试的差距。大量实验表明，ShotStream能以亚秒级延迟生成连贯的多镜头视频，在单GPU上达到16 FPS。其质量匹配或超越速度较慢的双向模型，为实时交互式故事讲述开辟了新途径。训练和推理代码以及模型均已开源。

English

Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our