HoloCine: 영화적 멀티샷 장편 비디오 서사의 통합적 생성

초록

최첨단 텍스트-비디오 모델은 단일 클립 생성에는 뛰어나나 스토리텔링의 핵심인 일관된 멀티샷 narrative를 생성하는 데는 한계가 있습니다. 우리는 이러한 "내러티브 격차"를 HoloCine 모델로 해소합니다. HoloCine은 전체 장면을 종합적으로 생성하여 첫 샷부터 마지막 샷까지 전역적 일관성을 보장합니다. 우리의 아키텍처는 텍스트 프롬프트를 특정 샷에 지역화하는 Window Cross-Attention 메커니즘을 통해 정확한 연출 제어를 달성하며, Sparse Inter-Shot Self-Attention 패턴(샷 내부는 조밀하지만 샷 간에는 희소)은 분 단위 생성에 필요한 효율성을 보장합니다. 내러티브 일관성에서 새로운 최첨단 성능을 보여주는 것을 넘어, HoloCine은 주목할 만한 창발적 능력인 캐릭터와 장면에 대한 지속적 메모리와 영화 기법에 대한 직관적 이해력을 발전시켰습니다. 우리의 연구는 클립 합성에서 자동화된 영화 제작으로의 중추적 전환을 의미하며, 종단간 시네마틱 창작을 현실 가능한 미래로 만듭니다. 코드는 https://holo-cine.github.io/에서 이용 가능합니다.

English

State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.

HoloCine: 영화적 멀티샷 장편 비디오 서사의 통합적 생성

HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

초록

Support