DreamID-Omni：面向可控人本音视频生成的一体化框架

摘要

近期，基础模型的突破性进展彻底改变了音视频联合生成领域。然而，现有方法通常将人本任务——包括基于参考的音视频生成（R2AV）、视频编辑（RV2AV）和音频驱动视频动画（RA2V）——视为相互独立的目标。此外，如何在单一框架内实现对多角色身份与音色特征的精准解耦控制仍是开放难题。本文提出DreamID-Omni，一个可控人本音视频生成的统一框架。具体而言，我们设计了对称条件扩散Transformer，通过对称条件注入机制整合异构条件信号。针对多人场景中普遍存在的身份-音色绑定失效和说话人混淆问题，我们提出双层级解耦策略：在信号层面采用同步RoPE技术确保注意力空间的刚性绑定，在语义层面通过结构化描述文本建立显式的属性-主体映射关系。此外，我们设计了多任务渐进式训练方案，利用弱约束生成先验来正则化强约束任务，防止过拟合并协调不同目标间的冲突。大量实验表明，DreamID-Omni在视频质量、音频保真度及音画一致性方面均达到全面领先水平，甚至超越主流商业闭源模型。我们将公开代码以弥合学术研究与商业级应用之间的鸿沟。

English

Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.