ShotStream: Streaming Multi-Shot Videogeneratie voor Interactief Verhalen Vertellen

Samenvatting

Meervoudige-shot videogeneratie is cruciaal voor lange narratieve verhalen, maar huidige bidirectionele architecturen lijden onder beperkte interactiviteit en hoge latentie. Wij stellen ShotStream voor, een nieuwe causale meervoudige-shot architectuur die interactieve verhaalvertelling en efficiënte real-time framegeneratie mogelijk maakt. Door de taak te herformuleren als volgende-shot generatie geconditioneerd op historische context, stelt ShotStream gebruikers in staat om doorlopende narratieven dynamisch aan te sturen via streaming prompts. Wij bereiken dit door eerst een tekst-naar-video model te fine-tunen tot een bidirectionele volgende-shot generator, die vervolgens wordt gedistilleerd tot een causale student via Distribution Matching Distillation. Om de uitdagingen van inter-shot consistentie en foutaccumulatie inherent aan autoregressieve generatie te overwinnen, introduceren we twee belangrijke innovaties. Ten eerste behoudt een dual-cache geheugenmechanisme visuele coherentie: een globale contextcache bewaart conditionele frames voor inter-shot consistentie, terwijl een lokale contextcache gegenereerde frames binnen het huidige shot vasthoudt voor intra-shot consistentie. Een RoPE-discontinuïteitsindicator wordt gebruikt om de twee caches expliciet te onderscheiden en ambiguïteit te elimineren. Ten tweede, om foutaccumulatie tegen te gaan, stellen we een tweefasen-distillatiestrategie voor. Deze begint met intra-shot self-forcing geconditioneerd op grondwaarheid historische shots en breidt geleidelijk uit naar inter-shot self-forcing met zelf gegenereerde geschiedenissen, waardoor de kloof tussen training en test effectief wordt overbrugd. Uitgebreide experimenten tonen aan dat ShotStream coherente meervoudige-shot video's genereert met subseconden latentie, waarbij 16 FPS op een enkele GPU wordt bereikt. Het evenaart of overtreft de kwaliteit van langzamere bidirectionele modellen, wat de weg vrijmaakt voor real-time interactieve verhaalvertelling. Trainings- en inferentiecode, evenals de modellen, zijn beschikbaar op onze

English

Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our

ShotStream: Streaming Multi-Shot Videogeneratie voor Interactief Verhalen Vertellen

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Samenvatting

Support