ShotVerse：推动文本驱动多镜头视频创作的电影级镜头控制技术

摘要

文本驱动的视频生成技术已极大降低了电影制作门槛，但在电影级多镜头场景中，摄像机控制仍是关键瓶颈。隐式文本提示缺乏精确性，而显式轨迹标注不仅带来极高的人工成本，还容易触发现有模型的执行故障。为突破这一瓶颈，我们提出以数据为中心的范式革新，认为对齐的（描述文本、运动轨迹、视频）三元组构成内在联合分布，可串联自动化脚本规划与精准执行。基于此洞见，我们推出ShotVerse框架，采用"先规划后控制"的双智能体协作架构：基于视觉语言模型的规划器利用空间先验从文本生成电影级全局对齐轨迹，控制器则通过摄像机适配器将轨迹渲染为多镜头视频。本方法的核心在于数据基础构建——我们设计了自动化多镜头摄像机标定流程，将离散单镜头轨迹对齐至统一全局坐标系，由此创建的高保真电影数据集ShotVerse-Bench配备三轨评估机制，成为框架基石。大量实验表明，ShotVerse有效弥合了不可靠的文本控制与高成本人工规划间的鸿沟，在实现卓越电影美学的同时，生成兼具摄像机运动精确性与跨镜头一致性的多镜头视频。

English

Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.