JavisDiT++:面向音视频联合生成的一体化建模与优化框架
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
February 22, 2026
作者: Kai Liu, Yanhao Zheng, Kai Wang, Shengqiong Wu, Rongjunchen Zhang, Jiebo Luo, Dimitrios Hatzinakos, Ziwei Liu, Hao Fei, Tat-Seng Chua
cs.AI
摘要
AIGC已从文生图快速扩展到涵盖视频与音频的高质量多模态生成领域。在此背景下,联合音视频生成(JAVG)已成为一项基础性任务,其目标是从文本描述中生成同步且语义对齐的视听内容。然而,与Veo3等先进商业模型相比,现有开源方法在生成质量、时序同步性以及与人类偏好对齐方面仍存在局限。为弥补这一差距,本文提出JavisDiT++——一个简洁而强大的JAVG统一建模与优化框架。首先,我们引入模态专家混合(MS-MoE)设计,在提升单模态生成质量的同时实现跨模态高效交互;其次,提出时序对齐RoPE(TA-RoPE)策略,实现音频与视频令牌在帧级别的显式同步;此外,开发了音视频直接偏好优化(AV-DPO)方法,从质量、一致性和同步性三个维度对齐模型输出与人类偏好。基于Wan2.1-1.3B-T2V构建的模型仅需约100万条公开训练数据即达到最优性能,在定性与定量评估中显著超越现有方法。我们通过系统消融实验验证了所提模块的有效性,相关代码、模型及数据集均已发布于https://JavisVerse.github.io/JavisDiT2-page。
English
AIGC has rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task that produces synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced commercial models such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. To bridge the gap, this paper presents JavisDiT++, a concise yet powerful framework for unified modeling and optimization of JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables cross-modal interaction efficacy while enhancing single-modal generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit, frame-level synchronization between audio and video tokens. Besides, we develop an audio-video direct preference optimization (AV-DPO) method to align model outputs with human preference across quality, consistency, and synchrony dimensions. Built upon Wan2.1-1.3B-T2V, our model achieves state-of-the-art performance merely with around 1M public training entries, significantly outperforming prior approaches in both qualitative and quantitative evaluations. Comprehensive ablation studies have been conducted to validate the effectiveness of our proposed modules. All the code, model, and dataset are released at https://JavisVerse.github.io/JavisDiT2-page.