JavisDiT：基於層次化時空先驗同步的聯合音視頻擴散Transformer

摘要

本文介紹了JavisDiT，一種新穎的聯合音頻-視頻擴散變壓器，專為同步音頻-視頻生成（JAVG）而設計。基於強大的擴散變壓器（DiT）架構，JavisDiT能夠從開放式用戶提示中同時生成高質量的音頻和視頻內容。為了確保最佳同步，我們通過分層時空同步先驗（HiST-Sypo）估計器引入了一種細粒度的時空對齊機制。該模塊提取全局和細粒度的時空先驗，指導視覺和聽覺組件之間的同步。此外，我們提出了一個新的基準，JavisBench，包含10,140個高質量的帶有文本字幕的有聲視頻，涵蓋多樣場景和複雜的現實世界情境。進一步，我們特別設計了一個穩健的指標，用於評估在現實世界複雜內容中生成的音頻-視頻對之間的同步性。實驗結果表明，JavisDiT通過確保高質量生成和精確同步，顯著優於現有方法，為JAVG任務設定了新標準。我們的代碼、模型和數據集將在https://javisdit.github.io/上公開提供。

English

This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at https://javisdit.github.io/.