OmniDirector：无需交叉配对数据的通用多镜头相机克隆

摘要

从参考视频中克隆相机运动是视频生成领域的一项重要任务，因为视频能够提供直观且精确的控制。现有方法要么直接使用无法处理多镜头生成的参数化表示，要么合成交叉配对数据，但受限于数据稀缺，导致在复杂相机运动克隆方面表现不佳。为解决这些问题，我们提出了一种通用的相机运动表示方法，将相机编码为网格运动视频。这种相机网格以可视化形式表示相机参数，并支持整合多样化的轨迹以实现多镜头视频生成。在此基础上，我们提出了OmniDirector，这是一个基于百万级相机网格-视频对训练的统⼀框架，能够协调角色、动作和相机，为多模态扩散Transformer提供导演级别的控制。此外，我们设计了一种新颖的分层提示扩展代理，通过理解信号关系系统性地描述相机运动和视觉内容，从而将不同的控制信号和谐地整合在一起。大量实验表明，我们的框架具有卓越的性能和出色的可控性。项目页面：https://ymlinfeng.github.io/OmniDirector.github.io/

English

Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Existing methods either directly use parametric representations that fail to handle multi-shot generation or synthesize cross-paired data, which suffer from data scarcity, resulting in poor performance in complicated camera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras as grid motion videos. This camera grid represents the camera parameters visually and supports the integration of diverse trajectories for multi-shot video generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers. Furthermore, we design a novel hierarchical prompt expansion agent that harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework. Project page: https://ymlinfeng.github.io/OmniDirector.github.io/