AudioToken: 텍스트 조건부 확산 모델의 오디오-이미지 생성을 위한 적응

초록

최근 몇 년간 이미지 생성 분야에서는 확산 모델(diffusion model)이 중심적인 역할을 하며 성능 면에서 큰 도약을 이루어냈습니다. 이러한 모델들은 고품질의 이미지를 생성할 수 있지만, 주로 텍스트 설명에 기반하여 조건화(conditioning)됩니다. 이는 "다른 모달리티를 기반으로 이러한 모델을 어떻게 적용할 수 있을까?"라는 질문을 제기합니다. 본 논문에서는 텍스트-이미지 생성을 위해 학습된 잠재 확산 모델(latent diffusion model)을 활용하여 오디오 녹음을 기반으로 이미지를 생성하는 새로운 방법을 제안합니다. 사전 학습된 오디오 인코딩 모델을 사용하여, 제안된 방법은 오디오를 새로운 토큰으로 인코딩하며, 이는 오디오와 텍스트 표현 사이의 적응 계층(adaptation layer)으로 간주될 수 있습니다. 이러한 모델링 패러다임은 학습 가능한 매개변수의 수가 적어 경량 최적화(lightweight optimization)에 적합한 접근 방식을 제공합니다. 객관적 및 주관적 지표를 고려할 때, 제안된 방법은 평가된 기준 방법들보다 우수한 성능을 보여줍니다. 코드와 샘플은 https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken에서 확인할 수 있습니다.

English

In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question: "how can we adopt such models to be conditioned on other modalities?". In this paper, we propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings. Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations. Such a modeling paradigm requires a small number of trainable parameters, making the proposed approach appealing for lightweight optimization. Results suggest the proposed method is superior to the evaluated baseline methods, considering objective and subjective metrics. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken.

AudioToken: 텍스트 조건부 확산 모델의 오디오-이미지 생성을 위한 적응

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

초록

Support