Cut2Next: Volgende Shot Genereren via In-Context Afstemming

Samenvatting

Effectieve multi-shot generatie vereist doelgerichte, filmachtige overgangen en strikte cinematografische continuïteit. Huidige methoden leggen echter vaak de nadruk op basisvisuele consistentie, waarbij cruciale montagepatronen (bijvoorbeeld shot/reverse shot, cutaways) die de narratieve stroom voor boeiend verhalen aansturen, worden verwaarloosd. Dit resulteert in uitvoer die visueel coherent kan zijn, maar narratieve verfijning en echte cinematografische integriteit mist. Om dit te overbruggen, introduceren we Next Shot Generation (NSG): het synthetiseren van een daaropvolgend, hoogwaardig shot dat kritisch voldoet aan professionele montagepatronen en tegelijkertijd strikte cinematografische continuïteit handhaaft. Ons framework, Cut2Next, maakt gebruik van een Diffusion Transformer (DiT). Het past in-context tuning toe, geleid door een nieuwe Hiërarchische Multi-Prompting-strategie. Deze strategie gebruikt Relationele Prompts om de algehele context en inter-shot montagestijlen te definiëren. Individuele Prompts specificeren vervolgens de inhoud per shot en cinematografische attributen. Samen leiden deze Cut2Next om cinematografisch passende volgende shots te genereren. Architectonische innovaties, Context-Aware Condition Injection (CACI) en Hiërarchische Attention Mask (HAM), integreren deze diverse signalen verder zonder nieuwe parameters te introduceren. We construeren RawCuts (grootschalig) en CuratedCuts (verfijnd) datasets, beide met hiërarchische prompts, en introduceren CutBench voor evaluatie. Experimenten tonen aan dat Cut2Next uitblinkt in visuele consistentie en tekstgetrouwheid. Cruciaal is dat gebruikersstudies een sterke voorkeur voor Cut2Next onthullen, met name vanwege de naleving van beoogde montagepatronen en algehele cinematografische continuïteit, wat het vermogen om hoogwaardige, narratief expressieve en cinematografisch coherente daaropvolgende shots te genereren valideert.

English

Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.

Cut2Next: Volgende Shot Genereren via In-Context Afstemming

Cut2Next: Generating Next Shot via In-Context Tuning

Samenvatting

Support