HoloCine：电影级多镜头长视频叙事的整体生成

摘要

当前最先进的文生视频模型虽能生成独立片段，却在构建连贯多镜头叙事——这一叙事艺术的核心要素——方面存在不足。我们通过HoloCine模型弥合了这种"叙事鸿沟"，该模型能够整体生成完整场景，确保从开场到结尾的全局一致性。我们的架构通过窗口交叉注意力机制将文本提示精准定位到特定镜头，实现精确的导演控制；同时采用稀疏镜头间自注意力模式（镜头内稠密连接，镜头间稀疏连接），确保分钟级生成所需的效率。除在叙事连贯性上树立新标杆外，HoloCine还展现出显著涌现能力：对角色与场景的持久记忆，以及对电影技法的直觉把握。我们的工作标志着从片段合成到自动化电影制作的关键转变，使端到端的电影创作成为可触及的未来。代码已开源：https://holo-cine.github.io/。

English

State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.