ChatPaper.aiChatPaper

JavisDiT:基於層次化時空先驗同步的聯合音視頻擴散Transformer

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

March 30, 2025
作者: Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, Tat-Seng Chua
cs.AI

摘要

本文介紹了JavisDiT,一種新穎的聯合音頻-視頻擴散變壓器,專為同步音頻-視頻生成(JAVG)而設計。基於強大的擴散變壓器(DiT)架構,JavisDiT能夠從開放式用戶提示中同時生成高質量的音頻和視頻內容。為了確保最佳同步,我們通過分層時空同步先驗(HiST-Sypo)估計器引入了一種細粒度的時空對齊機制。該模塊提取全局和細粒度的時空先驗,指導視覺和聽覺組件之間的同步。此外,我們提出了一個新的基準,JavisBench,包含10,140個高質量的帶有文本字幕的有聲視頻,涵蓋多樣場景和複雜的現實世界情境。進一步,我們特別設計了一個穩健的指標,用於評估在現實世界複雜內容中生成的音頻-視頻對之間的同步性。實驗結果表明,JavisDiT通過確保高質量生成和精確同步,顯著優於現有方法,為JAVG任務設定了新標準。我們的代碼、模型和數據集將在https://javisdit.github.io/上公開提供。
English
This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at https://javisdit.github.io/.

Summary

AI-Generated Summary

PDF544April 4, 2025