JavisDiT: 階層的時空間事前同期を備えた音声-映像統合Diffusion Transformer

要旨

本論文では、同期音声-映像生成（JAVG）のための新しいJoint Audio-Video Diffusion Transformer（JavisDiT）を紹介する。強力なDiffusion Transformer（DiT）アーキテクチャを基盤として構築されたJavisDiTは、オープンエンドのユーザープロンプトから高品質な音声と映像コンテンツを同時に生成することができる。最適な同期を確保するために、Hierarchical Spatial-Temporal Synchronized Prior（HiST-Sypo）Estimatorを通じた細粒度の時空間アライメントメカニズムを導入した。このモジュールは、グローバルおよび細粒度の時空間事前情報を抽出し、視覚的要素と聴覚的要素の同期を導く。さらに、多様なシーンと複雑な現実世界のシナリオにわたる10,140の高品質なテキストキャプション付き音声映像からなる新しいベンチマーク、JavisBenchを提案する。さらに、現実世界の複雑なコンテンツにおける生成された音声-映像ペアの同期を評価するための堅牢な指標を特別に考案した。実験結果は、JavisDiTが高品質な生成と正確な同期を両立することで既存の手法を大幅に上回り、JAVGタスクの新たな基準を確立することを示している。我々のコード、モデル、データセットはhttps://javisdit.github.io/で公開される予定である。

English

This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at https://javisdit.github.io/.

JavisDiT: 階層的時空間事前同期を備えた音声-映像統合Diffusion Transformer

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

要旨

Support