Cut2Next:透過上下文調適生成下一鏡頭
Cut2Next: Generating Next Shot via In-Context Tuning
August 11, 2025
作者: Jingwen He, Hongbo Liu, Jiajun Li, Ziqi Huang, Yu Qiao, Wanli Ouyang, Ziwei Liu
cs.AI
摘要
有效的多镜头生成要求有目的性、电影般的过渡和严格的电影连续性。然而,当前的方法往往优先考虑基本的视觉一致性,忽视了推动叙事流畅性的关键编辑模式(如正反打镜头、插入镜头等),这些模式对于引人入胜的叙事至关重要。这导致输出结果可能在视觉上连贯,但缺乏叙事的复杂性和真正的电影完整性。为了弥补这一差距,我们引入了“下一镜头生成”(Next Shot Generation, NSG):合成一个后续的高质量镜头,该镜头严格遵循专业编辑模式,同时保持严谨的电影连续性。我们的框架Cut2Next利用了一种扩散变换器(Diffusion Transformer, DiT),并通过一种新颖的层次化多提示策略进行上下文调优。该策略使用关系提示来定义整体上下文和镜头间的编辑风格,而个体提示则指定每个镜头的内容和电影摄影属性。这些提示共同指导Cut2Next生成电影上合适的下一镜头。架构创新,如上下文感知条件注入(Context-Aware Condition Injection, CACI)和层次化注意力掩码(Hierarchical Attention Mask, HAM),进一步整合了这些多样化的信号,而无需引入新的参数。我们构建了RawCuts(大规模)和CuratedCuts(精炼)两个数据集,均带有层次化提示,并引入了CutBench进行评估。实验表明,Cut2Next在视觉一致性和文本保真度方面表现出色。关键的是,用户研究显示,用户对Cut2Next有强烈的偏好,特别是对其遵循预定编辑模式和整体电影连续性的认可,验证了其生成高质量、叙事表达力强且电影连贯的后续镜头的能力。
English
Effective multi-shot generation demands purposeful, film-like transitions and
strict cinematic continuity. Current methods, however, often prioritize basic
visual consistency, neglecting crucial editing patterns (e.g., shot/reverse
shot, cutaways) that drive narrative flow for compelling storytelling. This
yields outputs that may be visually coherent but lack narrative sophistication
and true cinematic integrity. To bridge this, we introduce Next Shot Generation
(NSG): synthesizing a subsequent, high-quality shot that critically conforms to
professional editing patterns while upholding rigorous cinematic continuity.
Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs
in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This
strategy uses Relational Prompts to define overall context and inter-shot
editing styles. Individual Prompts then specify per-shot content and
cinematographic attributes. Together, these guide Cut2Next to generate
cinematically appropriate next shots. Architectural innovations, Context-Aware
Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further
integrate these diverse signals without introducing new parameters. We
construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with
hierarchical prompts, and introduce CutBench for evaluation. Experiments show
Cut2Next excels in visual consistency and text fidelity. Crucially, user
studies reveal a strong preference for Cut2Next, particularly for its adherence
to intended editing patterns and overall cinematic continuity, validating its
ability to generate high-quality, narratively expressive, and cinematically
coherent subsequent shots.