AudioToken：將以文本為條件的擴散模型適應至音訊轉圖像生成

摘要

近年來，圖像生成在性能上取得了巨大的飛躍，其中擴散模型發揮了核心作用。儘管能夠生成高質量的圖像，這類模型主要是根據文本描述進行條件設置。這引出了一個問題：“我們如何將這些模型調整為根據其他模態進行條件設置呢？”在本文中，我們提出了一種新方法，利用為文本到圖像生成訓練的潛在擴散模型來生成根據音頻錄製條件的圖像。通過使用預先訓練的音頻編碼模型，所提出的方法將音頻編碼為一個新的標記，這可以被視為音頻和文本表示之間的適應層。這種建模範式需要少量可訓練參數，使得所提出的方法在輕量級優化方面具有吸引力。結果表明，根據客觀和主觀指標，所提出的方法優於評估的基準方法。代碼和樣本可在以下網址找到：https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken。

English

In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question: "how can we adopt such models to be conditioned on other modalities?". In this paper, we propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings. Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations. Such a modeling paradigm requires a small number of trainable parameters, making the proposed approach appealing for lightweight optimization. Results suggest the proposed method is superior to the evaluated baseline methods, considering objective and subjective metrics. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken.

AudioToken：將以文本為條件的擴散模型適應至音訊轉圖像生成

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

摘要

Support