AudioToken: テキスト条件付き拡散モデルの音声-画像生成への適応

要旨

近年、画像生成の性能は飛躍的に向上しており、その中心的な役割を担っているのが拡散モデルです。高品質な画像を生成するこれらのモデルは、主にテキスト記述を条件として使用しています。これにより、「他のモダリティを条件としてこれらのモデルをどのように適用できるか？」という疑問が生じます。本論文では、テキストから画像を生成するために訓練された潜在拡散モデルを活用し、音声記録を条件として画像を生成する新しい手法を提案します。提案手法では、事前訓練された音声エンコーディングモデルを使用して音声を新しいトークンにエンコードします。このトークンは、音声とテキスト表現の間の適応層と見なすことができます。このようなモデリングパラダイムでは、訓練可能なパラメータ数が少なくて済むため、提案手法は軽量な最適化に適しています。結果は、客観的および主観的メトリクスを考慮すると、提案手法が評価されたベースライン手法よりも優れていることを示しています。コードとサンプルは以下で利用可能です：https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken。

English

In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question: "how can we adopt such models to be conditioned on other modalities?". In this paper, we propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings. Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations. Such a modeling paradigm requires a small number of trainable parameters, making the proposed approach appealing for lightweight optimization. Results suggest the proposed method is superior to the evaluated baseline methods, considering objective and subjective metrics. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken.

AudioToken: テキスト条件付き拡散モデルの音声-画像生成への適応

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

要旨

Support