ShotVerse: Verbeterde cinematografische camerabesturing voor tekstgestuurde multi-shot videoproductie

Samenvatting

Tekstgestuurde videogeneratie heeft filmcreatie gedemocratiseerd, maar camerabesturing in cinematische multi-shot scenario's blijft een belangrijke barrière. Impliciete tekstuele prompts missen precisie, terwijl expliciete trajectconditionering een verbiedende handmatige overhead met zich meebrengt en vaak uitvoeringsfouten veroorzaakt in huidige modellen. Om deze bottleneck te overwinnen, stellen we een data-centrisch paradigma-shift voor, waarbij we veronderstellen dat uitgelijnde (Bijschrift, Traject, Video)-triplets een inherente gezamenlijke verdeling vormen die geautomatiseerde plotplanning en precieze uitvoering kan verbinden. Geleid door dit inzicht presenteren we ShotVerse, een "Plan-then-Control" raamwerk dat generatie ontkoppelt in twee collaboratieve agents: een op VLM (Vision-Language Model) gebaseerde Planner die ruimtelijke priors benut om cinematische, globaal uitgelijnde trajecten uit tekst te verkrijgen, en een Controller die deze trajecten via een camera-adapter weergeeft in multi-shot videocontent. Centraal in onze aanpak staat de constructie van een datafundament: we ontwerpen een geautomatiseerde multi-shot camerakalibratiepijplijn die onsamenhangende enkel-shot trajecten uitlijnt in een verenigd globaal coördinatensysteem. Dit vergemakkelijkt de samenstelling van ShotVerse-Bench, een hoogwaardige cinematische dataset met een drie-sporen evaluatieprotocol dat de basis vormt voor ons raamwerk. Uitgebreide experimenten tonen aan dat ShotVerse effectief de kloof overbrugt tussen onbetrouwbare tekstuele controle en arbeidsintensieve handmatige planning, waarbij superieure cinematische esthetiek wordt bereikt en multi-shot video's worden gegenereerd die zowel cameranauwkeurig als cross-shot consistent zijn.

English

Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.

ShotVerse: Verbeterde cinematografische camerabesturing voor tekstgestuurde multi-shot videoproductie

ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

Samenvatting

Support