ChatPaper.aiChatPaper

AV-DiT:高效音視覺擴散Transformer 用於聯合音頻和視頻生成

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

June 11, 2024
作者: Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, Yapeng Tian
cs.AI

摘要

最近的擴散Transformer(DiTs)展示了在生成高質量單模態內容方面的印象深刻能力,包括圖像、視頻和音頻。然而,目前尚未深入探討基於Transformer的擴散器是否能有效去噪高斯噪聲以實現出色的多模態內容創建。為了彌合這一差距,我們引入了AV-DiT,這是一種新穎且高效的音視頻擴散Transformer,旨在生成具有視覺和音頻軌跡的高質量、逼真的視頻。為了最小化模型複雜性和計算成本,AV-DiT利用了一個在僅圖像數據上預先訓練的共享DiT骨幹,僅有輕量級的新插入適配器可進行訓練。這個共享骨幹促進了音頻和視頻的生成。具體來說,視頻分支將一個可訓練的時間注意層整合到一個凍結的預先訓練DiT塊中,以實現時間一致性。此外,一小部分可訓練參數使基於圖像的DiT塊適應音頻生成。一個額外的共享DiT塊,配備輕量級參數,促進了音頻和視覺模態之間的特徵交互,確保對齊。在AIST++和Landscape數據集上進行的大量實驗表明,AV-DiT在聯合音視頻生成方面實現了最先進的性能,並且具有顯著更少的可調參數。此外,我們的結果突顯了單個共享圖像生成骨幹與模態特定適配器足以構建聯合音視頻生成器。我們的源代碼和預訓練模型將被釋放。
English
Recent Diffusion Transformers (DiTs) have shown impressive capabilities in generating high-quality single-modality content, including images, videos, and audio. However, it is still under-explored whether the transformer-based diffuser can efficiently denoise the Gaussian noises towards superb multimodal content creation. To bridge this gap, we introduce AV-DiT, a novel and efficient audio-visual diffusion transformer designed to generate high-quality, realistic videos with both visual and audio tracks. To minimize model complexity and computational costs, AV-DiT utilizes a shared DiT backbone pre-trained on image-only data, with only lightweight, newly inserted adapters being trainable. This shared backbone facilitates both audio and video generation. Specifically, the video branch incorporates a trainable temporal attention layer into a frozen pre-trained DiT block for temporal consistency. Additionally, a small number of trainable parameters adapt the image-based DiT block for audio generation. An extra shared DiT block, equipped with lightweight parameters, facilitates feature interaction between audio and visual modalities, ensuring alignment. Extensive experiments on the AIST++ and Landscape datasets demonstrate that AV-DiT achieves state-of-the-art performance in joint audio-visual generation with significantly fewer tunable parameters. Furthermore, our results highlight that a single shared image generative backbone with modality-specific adaptations is sufficient for constructing a joint audio-video generator. Our source code and pre-trained models will be released.

Summary

AI-Generated Summary

PDF170December 8, 2024