Cut2Next：通过上下文调优生成下一镜头

摘要

高效的多镜头生成需要具备目的性、电影般的转场效果以及严格的镜头连续性。然而，当前的方法往往仅注重基础的视觉一致性，忽视了推动叙事流畅的关键剪辑模式（如正反打镜头、插入镜头等），这些模式对于引人入胜的故事讲述至关重要。这导致生成的输出可能在视觉上连贯，却缺乏叙事的复杂性和真正的电影完整性。为弥补这一差距，我们提出了“下一镜头生成”（Next Shot Generation, NSG）：合成一个后续的高质量镜头，该镜头不仅严格遵循专业剪辑模式，还保持了严谨的镜头连续性。我们的框架Cut2Next，基于扩散变换器（Diffusion Transformer, DiT），采用了一种新颖的层次化多提示策略进行上下文调优。该策略通过关系提示（Relational Prompts）定义整体上下文及镜头间的剪辑风格，而个体提示（Individual Prompts）则具体指定每个镜头的内容和摄影属性。这些提示共同引导Cut2Next生成符合电影艺术要求的下一镜头。架构上的创新，包括上下文感知条件注入（Context-Aware Condition Injection, CACI）和层次化注意力掩码（Hierarchical Attention Mask, HAM），进一步整合了这些多样化的信号，且无需引入额外参数。我们构建了RawCuts（大规模）和CuratedCuts（精炼）两个数据集，均配备层次化提示，并引入了CutBench用于评估。实验表明，Cut2Next在视觉一致性和文本忠实度方面表现卓越。尤为重要的是，用户研究显示，用户对Cut2Next有强烈偏好，特别是对其遵循既定剪辑模式和整体镜头连续性的认可，验证了其生成高质量、叙事表达力强且电影连贯的后续镜头的能力。

English

Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.

Cut2Next：通过上下文调优生成下一镜头

Cut2Next: Generating Next Shot via In-Context Tuning

摘要

Support