검색 강화 텍스트-오디오 생성

초록

최근 텍스트-오디오(TTA) 생성 분야에서의 진전에도 불구하고, 우리는 AudioCaps와 같은 불균형한 클래스 분포를 가진 데이터셋으로 훈련된 AudioLDM과 같은 최첨단 모델들이 생성 성능에서 편향되어 있음을 보여줍니다. 구체적으로, 이러한 모델들은 일반적인 오디오 클래스에서는 뛰어난 성능을 보이지만 희귀한 클래스에서는 성능이 떨어져 전체 생성 성능을 저하시킵니다. 우리는 이 문제를 '긴 꼬리 텍스트-오디오 생성' 문제라고 부릅니다. 이 문제를 해결하기 위해, 우리는 TTA 모델을 위한 간단한 검색-증강 접근법을 제안합니다. 구체적으로, 입력 텍스트 프롬프트가 주어지면, 먼저 Contrastive Language Audio Pretraining (CLAP) 모델을 활용하여 관련 텍스트-오디오 쌍을 검색합니다. 검색된 오디오-텍스트 데이터의 특징은 TTA 모델의 학습을 안내하는 추가 조건으로 사용됩니다. 우리는 제안된 접근법으로 AudioLDM을 강화하고, 결과적으로 증강된 시스템을 Re-AudioLDM이라고 명명합니다. AudioCaps 데이터셋에서 Re-AudioLDM은 1.37의 최첨단 Frechet Audio Distance (FAD)를 달성하여 기존 접근법을 큰 차이로 능가합니다. 더 나아가, Re-AudioLDM이 복잡한 장면, 희귀 오디오 클래스, 심지어 보지 못한 오디오 유형에 대해서도 현실적인 오디오를 생성할 수 있음을 보여주며, 이는 TTA 작업에서의 잠재력을 시사합니다.

English

Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer to this problem as long-tailed text-to-audio generation. To address this issue, we propose a simple retrieval-augmented approach for TTA models. Specifically, given an input text prompt, we first leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs. The features of the retrieved audio-text data are then used as additional conditions to guide the learning of TTA models. We enhance AudioLDM with our proposed approach and denote the resulting augmented system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the existing approaches by a large margin. Furthermore, we show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types, indicating its potential in TTA tasks.

검색 강화 텍스트-오디오 생성

Retrieval-Augmented Text-to-Audio Generation

초록

Support