JavisDiT:基於層次化時空先驗同步的聯合音視頻擴散Transformer
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
March 30, 2025
作者: Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, Tat-Seng Chua
cs.AI
摘要
本文介紹了JavisDiT,一種新穎的聯合音頻-視頻擴散變壓器,專為同步音頻-視頻生成(JAVG)而設計。基於強大的擴散變壓器(DiT)架構,JavisDiT能夠從開放式用戶提示中同時生成高質量的音頻和視頻內容。為了確保最佳同步,我們通過分層時空同步先驗(HiST-Sypo)估計器引入了一種細粒度的時空對齊機制。該模塊提取全局和細粒度的時空先驗,指導視覺和聽覺組件之間的同步。此外,我們提出了一個新的基準,JavisBench,包含10,140個高質量的帶有文本字幕的有聲視頻,涵蓋多樣場景和複雜的現實世界情境。進一步,我們特別設計了一個穩健的指標,用於評估在現實世界複雜內容中生成的音頻-視頻對之間的同步性。實驗結果表明,JavisDiT通過確保高質量生成和精確同步,顯著優於現有方法,為JAVG任務設定了新標準。我們的代碼、模型和數據集將在https://javisdit.github.io/上公開提供。
English
This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion
Transformer designed for synchronized audio-video generation (JAVG). Built upon
the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to
generate high-quality audio and video content simultaneously from open-ended
user prompts. To ensure optimal synchronization, we introduce a fine-grained
spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal
Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and
fine-grained spatio-temporal priors, guiding the synchronization between the
visual and auditory components. Furthermore, we propose a new benchmark,
JavisBench, consisting of 10,140 high-quality text-captioned sounding videos
spanning diverse scenes and complex real-world scenarios. Further, we
specifically devise a robust metric for evaluating the synchronization between
generated audio-video pairs in real-world complex content. Experimental results
demonstrate that JavisDiT significantly outperforms existing methods by
ensuring both high-quality generation and precise synchronization, setting a
new standard for JAVG tasks. Our code, model, and dataset will be made publicly
available at https://javisdit.github.io/.Summary
AI-Generated Summary