检索增强型文本转音频生成

摘要

尽管文本转音频（TTA）生成取得了一些进展，我们发现基于数据集（如AudioCaps）训练的AudioLDM等最先进模型在生成性能上存在偏见，因为这些数据集中的类别分布不平衡。具体而言，它们在生成常见音频类别方面表现出色，而在稀有类别方面表现不佳，从而降低了整体生成性能。我们将这一问题称为长尾文本转音频生成。为解决这一问题，我们提出了一种简单的检索增强方法用于TTA模型。具体而言，给定一个输入文本提示，我们首先利用对比语言音频预训练（CLAP）模型检索相关的文本-音频对。然后，检索到的音频-文本数据的特征被用作指导TTA模型学习的附加条件。我们用我们提出的方法增强了AudioLDM，并将结果增强系统标记为Re-AudioLDM。在AudioCaps数据集上，Re-AudioLDM实现了1.37的最先进Frechet音频距离（FAD），远远超过现有方法。此外，我们展示了Re-AudioLDM能够为复杂场景、稀有音频类别甚至未知音频类型生成逼真的音频，表明其在TTA任务中的潜力。

English

Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer to this problem as long-tailed text-to-audio generation. To address this issue, we propose a simple retrieval-augmented approach for TTA models. Specifically, given an input text prompt, we first leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs. The features of the retrieved audio-text data are then used as additional conditions to guide the learning of TTA models. We enhance AudioLDM with our proposed approach and denote the resulting augmented system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the existing approaches by a large margin. Furthermore, we show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types, indicating its potential in TTA tasks.

检索增强型文本转音频生成

Retrieval-Augmented Text-to-Audio Generation

摘要

Support