DreamID-Omni：人間中心のオーディオビデオ生成を制御する統合フレームワーク

要旨

基盤モデルの最近の進歩は、音声と映像の統合的生成に革命をもたらしました。しかし、既存のアプローチでは、参照ベースの音声-映像生成（R2AV）、映像編集（RV22AV）、音声駆動映像アニメーション（RA2V）といった人間中心のタスクを、通常は個別の目的として扱っています。さらに、単一のフレームワーク内で複数の人物の識別性や声の音色を精密に分離制御することは、未解決の課題です。本論文では、制御可能な人間中心の音声-映像生成のための統一フレームワークであるDreamID-Omniを提案します。具体的には、対称的な条件付き注入スキームを通じて異種の条件付け信号を統合するSymmetric Conditional Diffusion Transformerを設計します。マルチパーソンシナリオで広く見られる識別性と音色の結合失敗や話者混同を解決するため、デュアルレベル分離戦略を導入します：信号レベルではSynchronized RoPEにより厳密な注意空間結合を保証し、意味レベルではStructured Captionsにより明示的な属性-主体マッピングを確立します。さらに、弱制約生成事前分布を活用して強制約タスクを正則化し、過学習を防ぎ異種目的を調和させるマルチタスク漸進的訓練スキームを考案しました。大規模な実験により、DreamID-Omniが映像、音声、音声-視覚的一貫性の全ての側面で包括的な最先端性能を達成し、主要なプロプライエタリ商用モデルを凌駕することを実証します。学術研究と商用グレード応用の間のギャップを埋めるため、コードを公開予定です。

English

Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.

DreamID-Omni：人間中心のオーディオビデオ生成を制御する統合フレームワーク

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

要旨

Support