Cut2Next: インコンテキストチューニングによる次ショット生成

要旨

効果的なマルチショット生成には、意図的で映画のようなトランジションと厳密な映画的連続性が求められる。しかし、現在の手法では、基本的な視覚的一貫性を優先し、物語の流れを駆動する重要な編集パターン（例：ショット/リバースショット、カットアウェイ）を軽視する傾向がある。これにより、視覚的には一貫しているが、物語の洗練度や真の映画的整合性に欠ける出力が生じる。このギャップを埋めるため、我々はNext Shot Generation（NSG）を提案する。NSGは、厳密な映画的連続性を維持しながら、プロフェッショナルな編集パターンに厳密に準拠した高品質な次のショットを合成する。我々のフレームワークであるCut2Nextは、Diffusion Transformer（DiT）を活用し、新たなHierarchical Multi-Prompting戦略に基づくin-contextチューニングを採用する。この戦略では、Relational Promptsを使用して全体的なコンテキストとショット間の編集スタイルを定義し、Individual Promptsを使用して各ショットの内容と映画的属性を指定する。これらを組み合わせることで、Cut2Nextは映画的に適切な次のショットを生成する。アーキテクチャ上の革新であるContext-Aware Condition Injection（CACI）とHierarchical Attention Mask（HAM）は、新たなパラメータを導入することなく、これらの多様な信号を統合する。我々は、階層的プロンプトを備えた大規模なRawCutsデータセットと精選されたCuratedCutsデータセットを構築し、評価のためのCutBenchを導入した。実験結果は、Cut2Nextが視覚的一貫性とテキスト忠実性において優れていることを示している。特に、ユーザー調査では、Cut2Nextが意図した編集パターンと全体的な映画的連続性に忠実である点が強く支持され、高品質で物語的に表現力があり、映画的に一貫した次のショットを生成する能力が検証された。

English

Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.

Cut2Next: インコンテキストチューニングによる次ショット生成

Cut2Next: Generating Next Shot via In-Context Tuning

要旨

Support