DreamVideo-Omni：基於潛在身份強化學習的全域運動控制多主體影片客製化技術

摘要

雖然大規模擴散模型已徹底改變了影片合成技術，但實現對多主體身份與多粒度運動的精確控制仍是重大挑戰。近期嘗試彌合此差距的研究往往存在運動粒度有限、控制模糊性及身份退化等問題，導致身份保持與運動控制的表現未達最佳。本研究提出DreamVideo-Omni——一個通過漸進式兩階段訓練範式實現和諧多主體定制與全向運動控制的統一框架。在第一階段，我們整合了包含主體外觀、全局運動、局部動態及攝影機運動的綜合控制信號進行聯合訓練。為確保強健且精確的可控性，我們引入條件感知的3D旋轉位置嵌入來協調異構輸入，並採用分層運動注入策略以增強全局運動引導。此外，為解決多主體模糊性，我們設計群組與角色嵌入機制，將運動信號顯式錨定於特定身份，有效將複雜場景解構為獨立可控實例。在第二階段，為緩解身份退化問題，我們基於預訓練影片擴散骨架訓練潛在身份獎勵模型，設計潛在身份獎勵反饋學習範式。該方法在潛在空間中提供運動感知的身份獎勵，優先保障符合人類偏好的身份保持效果。憑藉我們策劃的大規模數據集及用於多主體全向運動控制評估的綜合DreamOmni基準，DreamVideo-Omni在生成具精確可控性的高品質影片方面展現出卓越性能。

English

While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.