AV-DiT：音声と映像の共同生成のための効率的なオーディオビジュアル拡散トランスフォーマー

要旨

最近のDiffusion Transformer（DiT）は、画像、動画、音声などの高品質な単一モダリティコンテンツの生成において印象的な能力を示しています。しかし、Transformerベースの拡散モデルが、優れたマルチモーダルコンテンツ生成に向けてガウシアンノイズを効率的に除去できるかどうかは、まだ十分に検討されていません。このギャップを埋めるため、我々はAV-DiTを提案します。これは、視覚と音声の両方のトラックを備えた高品質でリアルな動画を生成するために設計された、新規で効率的なオーディオビジュアル拡散Transformerです。モデルの複雑さと計算コストを最小限に抑えるため、AV-DiTは画像のみのデータで事前学習された共有DiTバックボーンを利用し、軽量な新規挿入アダプタのみを学習可能としています。この共有バックボーンは、音声と動画の両方の生成を促進します。具体的には、動画ブランチは、事前学習済みの凍結されたDiTブロックに学習可能な時間的注意層を組み込み、時間的一貫性を実現します。さらに、少数の学習可能なパラメータが、画像ベースのDiTブロックを音声生成に適応させます。軽量なパラメータを備えた追加の共有DiTブロックは、音声と視覚モダリティ間の特徴相互作用を促進し、整合性を確保します。AIST++およびLandscapeデータセットでの広範な実験により、AV-DiTが、大幅に少ない調整可能なパラメータで、共同オーディオビジュアル生成において最先端の性能を達成することが実証されました。さらに、我々の結果は、モダリティ固有の適応を備えた単一の共有画像生成バックボーンが、共同オーディオビデオジェネレータを構築するのに十分であることを強調しています。我々のソースコードと事前学習済みモデルは公開される予定です。

English

Recent Diffusion Transformers (DiTs) have shown impressive capabilities in generating high-quality single-modality content, including images, videos, and audio. However, it is still under-explored whether the transformer-based diffuser can efficiently denoise the Gaussian noises towards superb multimodal content creation. To bridge this gap, we introduce AV-DiT, a novel and efficient audio-visual diffusion transformer designed to generate high-quality, realistic videos with both visual and audio tracks. To minimize model complexity and computational costs, AV-DiT utilizes a shared DiT backbone pre-trained on image-only data, with only lightweight, newly inserted adapters being trainable. This shared backbone facilitates both audio and video generation. Specifically, the video branch incorporates a trainable temporal attention layer into a frozen pre-trained DiT block for temporal consistency. Additionally, a small number of trainable parameters adapt the image-based DiT block for audio generation. An extra shared DiT block, equipped with lightweight parameters, facilitates feature interaction between audio and visual modalities, ensuring alignment. Extensive experiments on the AIST++ and Landscape datasets demonstrate that AV-DiT achieves state-of-the-art performance in joint audio-visual generation with significantly fewer tunable parameters. Furthermore, our results highlight that a single shared image generative backbone with modality-specific adaptations is sufficient for constructing a joint audio-video generator. Our source code and pre-trained models will be released.

AV-DiT：音声と映像の共同生成のための効率的なオーディオビジュアル拡散トランスフォーマー

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

要旨

Support