OmniDirector：無需交叉配對數據的通用多拍相機克隆

摘要

從參考影片中複製攝影機運動是一項影片生成的重要任務，因為影片能提供直觀且精確的控制。現有方法要嘛直接使用無法處理多鏡頭生成的參數化表示，要嘛合成交叉配對數據，但由於數據稀缺，導致在複雜攝影機運動複製上表現不佳。為解決這些問題，我們提出一種通用的攝影機運動表示法，將攝影機編碼為網格運動影片。此攝影機網格以視覺方式呈現攝影機參數，並支援整合多樣軌跡以進行多鏡頭影片生成。在此基礎上，我們提出OmniDirector——一個統一的框架，透過百萬規模的攝影機網格-影片配對進行訓練，協調角色、動作與攝影機，為多模態擴散Transformer提供導演級別的控制。此外，我們設計了一種新穎的層級式提示擴展代理，透過理解訊號關係來系統性描述攝影機運動與視覺內容，和諧地整合不同控制訊號。大量實驗證明我們框架具有卓越的效能與出色的可控性。專案頁面：https://ymlinfeng.github.io/OmniDirector.github.io/

English

Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Existing methods either directly use parametric representations that fail to handle multi-shot generation or synthesize cross-paired data, which suffer from data scarcity, resulting in poor performance in complicated camera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras as grid motion videos. This camera grid represents the camera parameters visually and supports the integration of diverse trajectories for multi-shot video generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers. Furthermore, we design a novel hierarchical prompt expansion agent that harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework. Project page: https://ymlinfeng.github.io/OmniDirector.github.io/