DreamVideo-Omni:基于潜在身份强化学习的全域运动控制多主体视频定制
DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning
March 12, 2026
作者: Yujie Wei, Xinyu Liu, Shiwei Zhang, Hangjie Yuan, Jinbo Xing, Zhekai Chen, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Ruihang Chu, Yingya Zhang, Yike Guo, Xihui Liu, Hongming Shan
cs.AI
摘要
虽然大规模扩散模型已彻底改变视频生成技术,但实现对多主体身份与多粒度运动的精准控制仍是重大挑战。现有解决方案常受限于运动粒度不足、控制模糊和身份退化等问题,导致身份保持与运动控制效果欠佳。本研究提出DreamVideo-Omni统一框架,通过渐进式两阶段训练范式实现和谐的多主体定制与全运动控制。第一阶段通过联合训练整合综合控制信号,涵盖主体外观、全局运动、局部动态及摄像机运动。为确保控制力的鲁棒性与精确性,我们引入条件感知的3D旋转位置编码来协调异构输入,并采用分层运动注入策略增强全局运动引导。针对多主体模糊问题,创新性地提出组别与角色嵌入机制,将运动信号显式锚定至特定身份,有效将复杂场景解耦为独立可控实例。第二阶段为缓解身份退化,设计基于预训练视频扩散主干的潜在身份奖励反馈学习范式,通过训练潜在身份奖励模型在隐空间提供运动感知的身份奖励,优先保障符合人类偏好的身份保持效果。依托我们构建的大规模数据集及用于多主体全运动控制评估的DreamOmni综合基准,DreamVideo-Omni在生成具有精确可控性的高质量视频方面展现出卓越性能。
English
While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.