CausalCine: 다중 샷 비디오 내러티브를 위한 실시간 자기회귀 생성

초록

자동회귀 비디오 생성은 실시간, 개방형 합성을 목표로 한다. 그러나 영화적 스토리텔링은 단일 장면의 무한 연장에 불과하지 않으며, 진화하는 사건, 시점 변화, 그리고 개별적인 샷 경계를 통해 진행되어야 한다. 기존 자동회귀 모델은 이러한 환경에서 종종 어려움을 겪는다. 주로 단기 연속성을 위해 훈련된 이들 모델은 긴 시퀀스를 확장된 단일 샷으로 취급하여, 장기 롤아웃 동안 필연적으로 움직임 정체와 의미적 표류를 초래한다. 이러한 격차를 해소하기 위해 우리는 CausalCine을 소개한다. 이는 다중 샷 비디오 생성을 온라인 연출 과정으로 전환하는 대화형 자동회귀 프레임워크이다. CausalCine은 샷 변화에 걸쳐 인과적으로 생성하고, 실시간으로 동적 프롬프트를 수용하며, 이전 샷을 재생성하지 않고 컨텍스트를 재사용한다. 이를 위해 먼저 가속화 전에 복잡한 샷 전환을 학습하기 위해 네이티브 다중 샷 시퀀스에 대해 인과 기반 모델을 훈련한다. 그런 다음, 시간적 근접성이 아닌 주의 기반 관련성 점수에 따라 과거 KV 항목을 동적으로 검색하여 제한된 활성 메모리에서 교차 샷 일관성을 유지하는 콘텐츠 인식 메모리 라우팅(CAMR)을 제안한다. 마지막으로, 실시간 대화형 생성을 위해 인과 기반 모델을 소수의 스텝 생성기로 증류한다. 광범위한 실험을 통해 CausalCine이 자동회귀 기준선을 크게 능가하고 양방향 모델의 성능에 근접하면서 인과적 생성의 스트리밍 상호작용성을 구현함을 입증한다. 데모는 https://yihao-meng.github.io/CausalCine/ 에서 확인할 수 있다.

English

Autoregressive video generation aims at real-time, open-ended synthesis. Yet, cinematic storytelling is not merely the endless extension of a single scene; it requires progressing through evolving events, viewpoint shifts, and discrete shot boundaries. Existing autoregressive models often struggle in this setting. Trained primarily for short-horizon continuation, they treat long sequences as extended single shots, inevitably suffering from motion stagnation and semantic drift during long rollouts. To bridge this gap, we introduce CausalCine, an interactive autoregressive framework that transforms multi-shot video generation into an online directing process. CausalCine generates causally across shot changes, accepts dynamic prompts on the fly, and reuses context without regenerating previous shots. To achieve this, we first train a causal base model on native multi-shot sequences to learn complex shot transitions prior to acceleration. We then propose Content-Aware Memory Routing (CAMR), which dynamically retrieves historical KV entries according to attention-based relevance scores rather than temporal proximity, preserving cross-shot coherence under bounded active memory. Finally, we distill the causal base model into a few-step generator for real-time interactive generation. Extensive experiments demonstrate that CausalCine significantly outperforms autoregressive baselines and approaches the capability of bidirectional models while unlocking the streaming interactivity of causal generation. Demo available at https://yihao-meng.github.io/CausalCine/

CausalCine: 다중 샷 비디오 내러티브를 위한 실시간 자기회귀 생성

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

초록

Support