MultiShotMaster:可控多鏡頭影片生成框架
MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
December 2, 2025
作者: Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, Xu Jia
cs.AI
摘要
當前影片生成技術在單鏡頭片段上表現卓越,但在製作敘事性多鏡頭影片時仍面臨挑戰。這類影片需具備靈活的鏡頭調度、連貫的敘事邏輯,以及超越文字提示的控制能力。為解決這些難題,我們提出MultiShotMaster框架,實現高度可控的多鏡頭影片生成。我們透過整合兩種新穎的RoPE變體來擴展預訓練的單鏡頭模型:首先提出多鏡頭敘事RoPE,在鏡頭轉換時施加顯性相位偏移,既能保持時間敘事順序,又可實現靈活鏡頭編排;其次設計時空位置感知RoPE,透過引入參考標記與接地信號,實現時空錨定的參考信息注入。此外,為克服數據稀缺問題,我們建立自動化數據標註流程,可提取多鏡頭影片、描述文本、跨鏡頭接地信號及參考圖像。本框架利用內在架構特性支持多鏡頭影片生成,具備文本驅動的鏡頭間一致性、可定製主體運動控制及背景驅動的場景定製功能,鏡頭數量與時長均可靈活配置。大量實驗驗證了本框架的卓越性能與出眾可控性。
English
Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.