三提示控制：实现场景、主体与运动统一调控的视频扩散模型

摘要

近期视频扩散模型在视觉质量上取得了显著进展，但精确的细粒度控制仍是限制内容创作实用定制性的关键瓶颈。对AI视频创作者而言，三种控制形式至关重要：（一）场景构图，（二）多视角一致的主体定制，（三）相机位姿或物体运动调整。现有方法通常孤立处理这些维度，对任意姿态变化下的多视角主体合成与身份保持支持有限。这种统一架构的缺失导致难以实现多功能联合可控视频。我们提出三提示法——一个集成场景构图、多视角主体一致性与运动控制的统一框架及两阶段训练范式。该方法采用由背景场景的3D追踪点与前景主体的降采样RGB线索驱动的双条件运动模块。为确保可控性与视觉真实感之间的平衡，我们进一步提出推理控制网络尺度调度机制。三提示法支持创新工作流，包括将3D感知主体插入任意场景以及对图像中现有主体进行操控。实验结果表明，三提示法在多视角主体身份保持、3D一致性与运动准确性方面显著优于Phantom、DaS等专业基线模型。

English

Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.

三提示控制：实现场景、主体与运动统一调控的视频扩散模型

Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

摘要

Support