DreamVideo-Omni：潜在空間同一性強化学習による全方向モーション制御マルチサブジェクト動画カスタマイズ

要旨

大規模拡散モデルは映像合成に革命をもたらしたが、複数被写体の識別性とマルチ粒度の動きを同時に精密制御する課題は未解決である。既存手法は動きの粒度不足、制御の曖昧さ、識別性の劣化に悩まされ、識別性維持と動作制御の両立が不十分だった。本研究では、段階的2段階訓練パラダイムにより調和的な複数被写体カスタマイズと全方向動作制御を実現する統一フレームワーク「DreamVideo-Omni」を提案する。第一段階では、被写体外観・大域動作・局所動態・カメラ運動を含む総合的な制御信号を統合した共同訓練を実施。確固たる制御精度を確保するため、異種入力を調整する条件感知型3D回転位置符号化と、大域動作ガイダンスを強化する階層的動作注入戦略を導入。さらに複数被写体の曖昧性解消のため、グループ符号化と役割符号化により動作信号を特定識別子に明示的に紐付け、複雑な場景を独立制御可能なインスタンスに分離する。第二段階では、識別性劣化を軽減するため、事前学習済み映像拡散基盤に潜在識別性報酬モデルを構築し、潜在空間で動作感知型識別性報酬を提供する潜在識別性報酬フィードバック学習を設計。人間の選好に沿った識別性維持を優先する。大規模キュレーションデータセットと複数被写体・全方向動作制御評価のための総合ベンチマーク「DreamOmni Bench」に支えられ、DreamVideo-Omniは精密な制御性を備えた高品質映像生成で優れた性能を実証する。

English

While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.