DreamID-Omni:可控人本視音訊生成的統一框架
DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
February 12, 2026
作者: Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Songtao Zhao, Qian He, Xiangwang Hou
cs.AI
摘要
近期基礎模型的突破性進展徹底改變了音視頻聯合生成領域。然而現有方法通常將以人為中心的任務——包括參考式音視頻生成、視頻編輯式生成及音頻驅動視頻動畫——視為相互獨立的目標。更關鍵的是,在單一框架內實現對多角色身份與音色特徵的精準解耦控制仍是開放性難題。本文提出DreamID-Omni統一框架,實現可控的以人為中心音視頻生成。具體而言,我們設計了對稱條件擴散轉換器,通過對稱條件注入方案整合異構條件信號。為解決多人場景中普遍存在的身份-音色綁定失效和說話者混淆問題,我們提出雙層解耦策略:在信號層面採用同步旋轉位置編碼確保剛性注意力空間綁定,在語義層面通過結構化描述文本建立顯式屬性-主體映射。此外,我們設計了多任務漸進訓練方案,利用弱約束生成先驗來規範強約束任務,防止過擬合並協調不同目標。大量實驗表明,DreamID-Omni在視頻質量、音頻保真度與音視覺一致性方面實現全面領先,甚至超越主流商業專有模型。我們將公開代碼以彌合學術研究與商業級應用之間的鴻溝。
English
Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.