JavisDiT++：面向音視頻聯合生成的統一建模與優化框架

摘要

AIGC已從文字到圖像生成快速擴展至跨視頻與音頻的高質量多模態合成領域。在此背景下，聯合音視頻生成（JAVG）已成為一項基礎任務，旨在從文本描述中生成同步且語義對齊的聲音與視覺內容。然而相較於Veo3等先進商業模型，現有開源方法仍在生成質量、時序同步性及與人類偏好對齊方面存在侷限。為彌合這一差距，本文提出JavisDiT++——一個簡潔而強大的JAVG統一建模與優化框架。首先，我們引入模態專用混合專家（MS-MoE）設計，在提升單模態生成質量的同時實現跨模態交互效能；其次提出時序對齊旋轉位置編碼（TA-RoPE）策略，實現音視頻令牌在幀級別的顯式同步；此外開發了音視頻直接偏好優化（AV-DPO）方法，從質量、一致性和同步性三個維度對齊模型輸出與人類偏好。基於Wan2.1-1.3B-T2V構建的模型僅需約100萬公開訓練樣本即可實現最先進性能，在定性與定量評估中均顯著超越現有方法。我們通過全面消融實驗驗證了所提模塊的有效性，所有代碼、模型及數據集均已開源於https://JavisVerse.github.io/JavisDiT2-page。

English

AIGC has rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task that produces synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced commercial models such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. To bridge the gap, this paper presents JavisDiT++, a concise yet powerful framework for unified modeling and optimization of JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables cross-modal interaction efficacy while enhancing single-modal generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit, frame-level synchronization between audio and video tokens. Besides, we develop an audio-video direct preference optimization (AV-DPO) method to align model outputs with human preference across quality, consistency, and synchrony dimensions. Built upon Wan2.1-1.3B-T2V, our model achieves state-of-the-art performance merely with around 1M public training entries, significantly outperforming prior approaches in both qualitative and quantitative evaluations. Comprehensive ablation studies have been conducted to validate the effectiveness of our proposed modules. All the code, model, and dataset are released at https://JavisVerse.github.io/JavisDiT2-page.

JavisDiT++：面向音視頻聯合生成的統一建模與優化框架

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

摘要

Support