ChatPaper.aiChatPaper

《電影船長:邁向短片生成》

Captain Cinema: Towards Short Movie Generation

July 24, 2025
作者: Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, Lu Jiang
cs.AI

摘要

我們推出「Captain Cinema」,這是一個專為短片生成而設計的框架。在給定電影情節的詳細文字描述後,我們的方法首先生成一系列關鍵幀,這些關鍵幀勾勒出整個敘事,確保了故事線和視覺呈現(如場景和角色)的長程一致性。我們將此步驟稱為自上而下的關鍵幀規劃。這些關鍵幀隨後作為條件信號,輸入到一個支持長上下文學習的視頻合成模型中,以生成它們之間的時空動態。此步驟被稱為自下而上的視頻合成。為了支持多場景長敘事電影作品的穩定高效生成,我們引入了一種交錯訓練策略,專門針對長上下文視頻數據的多模態擴散變壓器(MM-DiT)進行了適配。我們的模型在一個特別策劃的電影數據集上進行訓練,該數據集由交錯的數據對組成。實驗結果表明,「Captain Cinema」在自動創建視覺連貫、敘事一致的高質量短片方面表現優異,且效率出眾。項目頁面:https://thecinema.ai
English
We present Captain Cinema, a generation framework for short movie generation. Given a detailed textual description of a movie storyline, our approach firstly generates a sequence of keyframes that outline the entire narrative, which ensures long-range coherence in both the storyline and visual appearance (e.g., scenes and characters). We refer to this step as top-down keyframe planning. These keyframes then serve as conditioning signals for a video synthesis model, which supports long context learning, to produce the spatio-temporal dynamics between them. This step is referred to as bottom-up video synthesis. To support stable and efficient generation of multi-scene long narrative cinematic works, we introduce an interleaved training strategy for Multimodal Diffusion Transformers (MM-DiT), specifically adapted for long-context video data. Our model is trained on a specially curated cinematic dataset consisting of interleaved data pairs. Our experiments demonstrate that Captain Cinema performs favorably in the automated creation of visually coherent and narrative consistent short movies in high quality and efficiency. Project page: https://thecinema.ai
PDF373July 25, 2025