ShotVerse: テキスト駆動型マルチショット映像制作のための映画的カメラ制御技術の進展

要旨

テキスト駆動型ビデオ生成は映像制作の民主化を進めてきたが、映画的マルチショットシナリオにおけるカメラ制御は依然として大きな障壁となっている。暗黙的なテキストプロンプトは精度に欠け、明示的な軌道条件付けは過度な手作業を強要し、現行モデルでは実行失敗を引き起こしがちである。このボトルネックを克服するため、我々はデータ中心のパラダイム転換を提案する。（キャプション、軌道、ビデオ）の三つ組が自動化されたプロット作成と精密な実行を結びつける固有の結合分布を形成すると仮定する。この知見に基づき、生成を2つの協働エージェントに分離する「Plan-then-Control」フレームワーク「ShotVerse」を提案する。VLM（Vision-Language Model）ベースのPlannerは空間事前分布を活用してテキストから映画的で大域的に整合した軌道を取得し、Controllerはカメラアダプタを介してこれらの軌道をマルチショット映像コンテンツにレンダリングする。本手法の中核はデータ基盤の構築にある。断片的な単一ショット軌道を統一された大座標系に整合させる自動マルチショットカメラ較正パイプラインを設計し、3段階評価プロトコルを備えた高精細な映画用データセットShotVerse-Benchを構築した。これは本フレームワークの基盤をなす。大規模な実験により、ShotVerseが信頼性の低いテキスト制御と労力を要する手動プロット作成の間の隔たりを効果的に埋め、優れた映画的審美性を達成し、カメラ精度とショット間整合性の両立したマルチショットビデオを生成することを実証した。

English

Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.

ShotVerse: テキスト駆動型マルチショット映像制作のための映画的カメラ制御技術の進展

ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

要旨

Support