ShotStream: 인터랙티브 스토리텔링을 위한 스트리밍 멀티샷 비디오 생성

초록

멀티샷 비디오 생성은 긴 서사적 스토리텔링에 필수적이지만, 기존 양방향 아키텍처는 제한된 상호작용성과 높은 지연 시간 문제를 안고 있습니다. 본 논문에서는 대화형 스토리텔링과 효율적인 실시간 프레임 생성을 가능하게 하는 새로운 인과적 멀티샷 아키텍처인 ShotStream을 제안합니다. 작업을 이력 컨텍스트에 조건부인 다음 샷 생성으로 재정의함으로써, ShotStream은 사용자가 스트리밍 프롬프트를 통해 진행 중인 내러티브를 동적으로 지시할 수 있게 합니다. 이를 위해 먼저 텍스트-비디오 모델을 양방향 다음 샷 생성기로 미세 조정한 후, 분포 매칭 증류(Distribution Matching Distillation)를 통해 인과적 학생 모델로 증류합니다. 자기회귀 생성에 내재된 샷 간 일관성 및 오류 누적 문제를 해결하기 위해 두 가지 핵심 혁신을 도입했습니다. 첫째, 이중 캐시 메모리 메커니즘으로 시각적 일관성을 유지합니다: 전역 컨텍스트 캐시는 샷 간 일관성을 위해 조건부 프레임을 보관하고, 지역 컨텍스트 캐시는 샷 내 일관성을 위해 현재 샷에서 생성된 프레임을 보유합니다. 또한 RoPE 불연속 지시자를 사용하여 두 캐시를 명시적으로 구분하여 모호성을 제거합니다. 둘째, 오류 누적을 완화하기 위해 2단계 증류 전략을 제안합니다. 이는 실제 이력 샷에 조건부인 샷 내 자기 강화(self-forcing)로 시작하여, 자체 생성된 이력을 사용한 샷 간 자기 강화로 점진적으로 확장되어 훈련-테스트 간극을 효과적으로 해소합니다. 광범위한 실험을 통해 ShotStream이 초 미만의 지연 시간으로 일관된 멀티샷 비디오를 생성하며 단일 GPU에서 16 FPS를 달성함을 입증했습니다. 이는 더 느린 양방향 모델들의 품질에 필적하거나 이를 능가하며, 실시간 대화형 스토리텔링의 길을 열어줍니다. 학습 및 추론 코드와 모델은 우리의

English

Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our

ShotStream: 인터랙티브 스토리텔링을 위한 스트리밍 멀티샷 비디오 생성

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

초록

Support