ShotVerse：推动文本驱动多镜头视频创作的电影级镜头控制技术

摘要

文本驅動的影片生成技術已大幅降低了影像創作門檻，但在電影級多鏡頭場景中，攝影機控制仍是關鍵難題。隱性文本提示缺乏精確度，而顯性軌跡約束不僅需要大量人工操作，在現有模型中還容易引發執行錯誤。為突破此瓶頸，我們提出以數據為核心的範式轉變：通過對齊的（描述文本、運鏡軌跡、影片）三元組構建內在聯合分佈，從而串聯自動化分鏡規劃與精準執行。基於此洞見，我們推出ShotVerse——採用「先規劃後控制」框架的系統，將生成過程解耦為兩個協同智能體：基於視覺語言模型的規劃器利用空間先驗知識，從文本生成具有電影感且全局對齊的運鏡軌跡；控制器則通過攝影機適配器將這些軌跡渲染為多鏡頭影片內容。該方法的核心在於數據基礎建設：我們設計了自動化多鏡頭攝影機標定流程，能將離散單鏡頭軌跡整合至統一全局座標系，據此構建包含三軌評估協議的高擬真電影數據集ShotVerse-Bench作為框架基石。大量實驗表明，ShotVerse有效彌合了不可靠的文本控制與勞動密集型人工規劃之間的鴻溝，在實現優越電影美學的同時，能生成運鏡精準且跨鏡頭連貫的多鏡頭影片。

English

Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.

ShotVerse：推动文本驱动多镜头视频创作的电影级镜头控制技术

ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

摘要

Support