ShotVerse:推动文本驱动多镜头视频创作的电影级镜头控制技术
ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation
March 12, 2026
作者: Songlin Yang, Zhe Wang, Xuyi Yang, Songchun Zhang, Xianghao Kong, Taiyi Wu, Xiaotong Zhao, Ran Zhang, Alan Zhao, Anyi Rao
cs.AI
摘要
文本驅動的影片生成技術已大幅降低了影像創作門檻,但在電影級多鏡頭場景中,攝影機控制仍是關鍵難題。隱性文本提示缺乏精確度,而顯性軌跡約束不僅需要大量人工操作,在現有模型中還容易引發執行錯誤。為突破此瓶頸,我們提出以數據為核心的範式轉變:通過對齊的(描述文本、運鏡軌跡、影片)三元組構建內在聯合分佈,從而串聯自動化分鏡規劃與精準執行。基於此洞見,我們推出ShotVerse——採用「先規劃後控制」框架的系統,將生成過程解耦為兩個協同智能體:基於視覺語言模型的規劃器利用空間先驗知識,從文本生成具有電影感且全局對齊的運鏡軌跡;控制器則通過攝影機適配器將這些軌跡渲染為多鏡頭影片內容。該方法的核心在於數據基礎建設:我們設計了自動化多鏡頭攝影機標定流程,能將離散單鏡頭軌跡整合至統一全局座標系,據此構建包含三軌評估協議的高擬真電影數據集ShotVerse-Bench作為框架基石。大量實驗表明,ShotVerse有效彌合了不可靠的文本控制與勞動密集型人工規劃之間的鴻溝,在實現優越電影美學的同時,能生成運鏡精準且跨鏡頭連貫的多鏡頭影片。
English
Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.