ChatPaper.aiChatPaper

AV-DiT:用于联合音频和视频生成的高效音频-视觉扩散Transformer

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

June 11, 2024
作者: Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, Yapeng Tian
cs.AI

摘要

最近的扩散Transformer(DiTs)展示了在生成高质量的单模态内容方面的卓越能力,包括图像、视频和音频。然而,目前尚未深入探讨基于Transformer的扩散器是否能有效去噪高斯噪声,以实现出色的多模态内容创作。为了弥补这一差距,我们引入了AV-DiT,这是一种新颖高效的音视频扩散Transformer,旨在生成具有视觉和音频轨道的高质量逼真视频。为了最小化模型复杂性和计算成本,AV-DiT利用了一个在仅图像数据上预训练的共享DiT骨干,仅有轻量级的新插入适配器是可训练的。这个共享骨干促进了音频和视频的生成。具体来说,视频分支将一个可训练的时间注意力层整合到一个冻结的预训练DiT块中,以实现时间一致性。此外,少量可训练参数使基于图像的DiT块适应音频生成。另外,一个额外的共享DiT块,配备了轻量级参数,促进了音频和视觉模态之间的特征交互,确保了对齐。在AIST++和Landscape数据集上的大量实验表明,AV-DiT在联合音视频生成方面实现了最先进的性能,且可调参数明显较少。此外,我们的结果突显了,一个共享的图像生成骨干与模态特定的适应是足以构建一个联合音视频生成器。我们将发布源代码和预训练模型。
English
Recent Diffusion Transformers (DiTs) have shown impressive capabilities in generating high-quality single-modality content, including images, videos, and audio. However, it is still under-explored whether the transformer-based diffuser can efficiently denoise the Gaussian noises towards superb multimodal content creation. To bridge this gap, we introduce AV-DiT, a novel and efficient audio-visual diffusion transformer designed to generate high-quality, realistic videos with both visual and audio tracks. To minimize model complexity and computational costs, AV-DiT utilizes a shared DiT backbone pre-trained on image-only data, with only lightweight, newly inserted adapters being trainable. This shared backbone facilitates both audio and video generation. Specifically, the video branch incorporates a trainable temporal attention layer into a frozen pre-trained DiT block for temporal consistency. Additionally, a small number of trainable parameters adapt the image-based DiT block for audio generation. An extra shared DiT block, equipped with lightweight parameters, facilitates feature interaction between audio and visual modalities, ensuring alignment. Extensive experiments on the AIST++ and Landscape datasets demonstrate that AV-DiT achieves state-of-the-art performance in joint audio-visual generation with significantly fewer tunable parameters. Furthermore, our results highlight that a single shared image generative backbone with modality-specific adaptations is sufficient for constructing a joint audio-video generator. Our source code and pre-trained models will be released.

Summary

AI-Generated Summary

PDF170December 8, 2024