音频令牌：用于音频到图像生成的文本条件扩散模型的调整

摘要

近年来，图像生成在性能上取得了巨大进展，其中扩散模型发挥着核心作用。尽管这些模型能生成高质量图像，但主要是根据文本描述进行条件化。这引出了一个问题：“我们如何能够使这些模型适应其他模态的条件？”在本文中，我们提出了一种新颖方法，利用为文本到图像生成训练的潜在扩散模型来生成基于音频录音的图像。该方法利用预训练的音频编码模型，将音频编码为一个新的标记，可被视为音频和文本表示之间的适应层。这种建模范式需要少量可训练参数，使得所提出的方法在轻量级优化方面具有吸引力。结果表明，根据客观和主观指标，所提出的方法优于评估的基准方法。代码和样本可在以下网址获取：https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken。

English

In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question: "how can we adopt such models to be conditioned on other modalities?". In this paper, we propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings. Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations. Such a modeling paradigm requires a small number of trainable parameters, making the proposed approach appealing for lightweight optimization. Results suggest the proposed method is superior to the evaluated baseline methods, considering objective and subjective metrics. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken.

音频令牌：用于音频到图像生成的文本条件扩散模型的调整

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

摘要

Support