ShotVerse: 텍스트 기반 멀티샷 비디오 생성을 위한 시네마틱 카메라 제어 기술 발전

초록

텍스트 기반 비디오 생성 기술이 영화 제작의 대중화를 이끌었지만, 시네마틱 다중 샷 환경에서의 카메라 제어는 여전히 큰 걸림돌로 남아 있습니다. 암묵적인 텍스트 프롬프트는 정확성이 부족한 반면, 명시적 궤적 조건 설정은 과도한 수동 작업을 요구하며 현재 모델에서 실행 실패를 자주 유발합니다. 이러한 병목 현상을 극복하기 위해 우리는 데이터 중심의 패러다임 전환을 제안합니다. 즉, 정렬된 (캡션, 궤적, 비디오) 삼중항이 자동화된 기획과 정밀한 실행을 연결할 수 있는 고유한 결합 분포를 형성한다는 가정에 기반합니다. 이러한 통찰을 바탕으로 우리는 생성 과정을 두 개의 협력 에이전트로 분리하는 "기획 후 제어" 프레임워크인 ShotVerse를 제안합니다. VLM(비전-언어 모델) 기반 플래너는 공간 사전 정보를 활용하여 텍스트로부터 시네마틱하며 전역적으로 정렬된 궤적을 도출하고, 컨트롤러는 카메라 어댑터를 통해 이러한 궤적을 다중 샷 비디오 콘텐츠로 렌더링합니다. 우리 접근법의 핵심은 데이터 기반 구축에 있습니다. 우리는 분리된 단일 샷 궤적을 통합된 전역 좌표계로 정렬하는 자동화된 다중 샷 카메라 보정 파이프라인을 설계했습니다. 이를 통해 3단계 평가 프로토콜을 갖춘 고품질 시네마틱 데이터셋인 ShotVerse-Bench를 구축하였으며, 이는 우리 프레임워크의 초석이 됩니다. 광범위한 실험을 통해 ShotVerse가 신뢰할 수 없는 텍스트 기반 제어와 노동 집약적 수동 기획 간의 간격을 효과적으로 메우며, 우수한 시네마틱 미학을 달성하고 카메라 정확도와 샷 간 일관성을 모두 갖춘 다중 샷 비디오를 생성함을 입증했습니다.

English

Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.

ShotVerse: 텍스트 기반 멀티샷 비디오 생성을 위한 시네마틱 카메라 제어 기술 발전

ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

초록

Support