JavisDiT: Gezamenlijke Audio-Video Diffusie Transformer met Hiërarchische Spatio-Temporele Prior Synchronisatie

Samenvatting

Dit artikel introduceert JavisDiT, een nieuwe Joint Audio-Video Diffusion Transformer die is ontworpen voor gesynchroniseerde audio-videogeneratie (JAVG). Gebaseerd op de krachtige Diffusion Transformer (DiT)-architectuur, is JavisDiT in staat om hoogwaardige audio- en videocontent gelijktijdig te genereren vanuit open-ended gebruikersprompts. Om optimale synchronisatie te garanderen, introduceren we een fijnmazig spatio-temporeel uitlijningsmechanisme via een Hiërarchische Spatio-Temporele Gesynchroniseerde Prior (HiST-Sypo) Estimator. Deze module extraheert zowel globale als fijnmazige spatio-temporele priors, die de synchronisatie tussen de visuele en auditieve componenten begeleiden. Bovendien stellen we een nieuwe benchmark voor, JavisBench, bestaande uit 10.140 hoogwaardige tekstgeannoteerde geluidsvideo's die diverse scènes en complexe real-world scenario's omvatten. Verder ontwikkelen we specifiek een robuuste metriek voor het evalueren van de synchronisatie tussen gegenereerde audio-videoparen in complexe real-world content. Experimentele resultaten tonen aan dat JavisDiT bestaande methoden significant overtreft door zowel hoogwaardige generatie als precieze synchronisatie te garanderen, waarmee een nieuwe standaard wordt gezet voor JAVG-taken. Onze code, model en dataset zullen publiekelijk beschikbaar worden gesteld op https://javisdit.github.io/.

English

This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at https://javisdit.github.io/.

JavisDiT: Gezamenlijke Audio-Video Diffusie Transformer met Hiërarchische Spatio-Temporele Prior Synchronisatie

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

Samenvatting

Support