视觉回声：用于音频-视觉生成的简单统一Transformer

摘要

近年来，随着逼真的生成结果和广泛的个性化应用，基于扩散的生成模型在视觉和音频生成领域引起了巨大关注。与文本到图像或文本到音频生成方面取得的相当大进展相比，音频到视觉或视觉到音频生成方面的研究进展相对较慢。最近的音频-视觉生成方法通常依赖于巨大的大型语言模型或可组合的扩散模型。本文并未设计另一个庞大的模型用于音频-视觉生成，而是展示了一种简单且轻量级的生成变压器，在多模态生成中尚未得到充分研究的情况下，可以在图像到音频生成上取得出色的结果。这个变压器在离散音频和视觉向量量化的GAN空间中运行，并以掩码去噪的方式进行训练。训练后，无需分类器指导即可进行现成部署，实现更好的性能，无需额外的训练或修改。由于变压器模型是模态对称的，因此也可以直接用于音频到图像生成和共同生成。在实验中，我们展示了我们的简单方法超越了最近的图像到音频生成方法。生成的音频样本可在以下链接找到：https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ

English

In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples can be found at https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ

视觉回声：用于音频-视觉生成的简单统一Transformer

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

摘要

Support