HoloCine：電影級多鏡頭長篇影片敘事的整體生成

摘要

當前最先進的文字轉影片模型雖擅長生成獨立片段，卻難以創建具有連貫性的多鏡頭敘事——這正是故事敘述的核心。我們透過HoloCine模型彌合此「敘事鴻溝」，該模型能整體生成完整場景，確保從首個鏡頭到結尾的全局一致性。我們的架構透過「視窗交叉注意力機制」實現精準的導演控制，將文字提示定位至特定鏡頭；同時採用「稀疏鏡頭間自注意力模式」（鏡頭內密集連接，鏡頭間稀疏連接），確保分鐘級影片生成所需的效率。除了在敘事連貫性上樹立新標竿，HoloCine更展現出顯著的湧現能力：對角色與場景的持久記憶，以及對電影技法的直覺掌握。本研究成果標誌著從片段合成到自動化電影製作的關鍵轉變，使端到端的電影創作成為可觸及的未來。程式碼公開於：https://holo-cine.github.io/。

English

State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.

HoloCine：電影級多鏡頭長篇影片敘事的整體生成

HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

摘要

Support