ChatPaper.aiChatPaper

視覺迴聲:用於音視覺生成的簡單統一Transformer

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

May 23, 2024
作者: Shiqi Yang, Zhi Zhong, Mengjie Zhao, Shusuke Takahashi, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji
cs.AI

摘要

近年來,隨著逼真的生成結果和廣泛的個性化應用,基於擴散的生成模型在視覺和音頻生成領域引起了廣泛關注。與在text2image或text2audio生成方面取得的相當大進展相比,audio2visual或visual2audio生成方面的研究進展相對緩慢。最近的音視頻生成方法通常倚賴龐大的大型語言模型或可組合的擴散模型。本文不是設計另一個龐大的模型來進行音視頻生成,而是退一步展示了一種簡單且輕量級的生成Transformer,在多模態生成中尚未得到充分研究,可以在image2audio生成上取得出色的結果。這個Transformer在離散音頻和視覺向量量化的GAN空間中運作,並以遮罩去噪的方式進行訓練。訓練後,無需分類器指導即可即插即用,實現更好的性能,無需任何額外的訓練或修改。由於Transformer模型是模態對稱的,因此也可以直接用於audio2image生成和共同生成。在實驗中,我們展示了我們的簡單方法超越了最近的image2audio生成方法。生成的音頻樣本可在以下鏈接找到:https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ
English
In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples can be found at https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ

Summary

AI-Generated Summary

PDF141December 15, 2024