드림비디오-옴니: 잠재 정체성 강화 학습을 통한 전방위 모션 제어 기반 다중 대상 비디오 맞춤화

초록

대규모 확산 모델이 비디오 합성에 혁명을 일으켰지만, 다중 객체 정체성과 다중 세분화 동작을 모두 정밀하게 제어하는 것은 여전히 큰 과제로 남아 있습니다. 이러한 격차를 해소하려는 최근의 시도들은 제한된 동작 세분성, 제어 모호성, 정체성 저하로 인해 정체성 보존 및 동작 제어 성능이 최적화되지 못하는 경우가 많습니다. 본 연구에서는 점진적 2단계 학습 패러다임을 통해 조화로운 다중 객체 맞춤화와 전체적 동작 제어를 가능하게 하는 통합 프레임워크인 DreamVideo-Omni를 제안합니다. 첫 번째 단계에서는 객체 외관, 전역 동작, 지역 동적 변화, 카메라 움직임을 포괄하는 종합적인 제어 신호를 통합하여 공동 학습을 수행합니다. 강력하고 정밀한 제어 가능성을 보장하기 위해 이기종 입력을 조정하는 조건 인식 3D 회전 위치 임베딩과 전역 동작 안내를 강화하는 계층적 동작 주입 전략을 도입합니다. 더 나아가 다중 객체 모호성을 해결하기 위해 그룹 및 역할 임베딩을 도입하여 동작 신호를 특정 정체성에 명시적으로 고정함으로써 복잡한 장면을 독립적으로 제어 가능한 인스턴스로 효과적으로 분리합니다. 두 번째 단계에서는 정체성 저하를 완화하기 위해, 사전 학습된 비디오 확산 백본 위에 잠재 정체성 보상 모델을 훈련시키는 잠재 정체성 보상 피드백 학습 패러다임을 설계합니다. 이는 잠재 공간에서 동작 인식 정체성 보상을 제공하여 인간의 선호도에 부합하는 정체성 보존을 우선시합니다. 저희가 구축한 대규모 데이터셋과 다중 객체 및 전체적 동작 제어 평가를 위한 포괄적인 DreamOmni Bench를 바탕으로, DreamVideo-Omni는 정밀한 제어 가능성을 갖춘 고품질 비디오 생성에서 우수한 성능을 입증합니다.

English

While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.