檢索增強式文本轉語音生成

摘要

儘管最近在文本轉語音（TTA）生成方面取得了進展，我們發現，像是AudioLDM這樣的最先進模型，訓練於類別分佈不平衡的數據集（如AudioCaps），在生成表現上存在偏見。具體而言，它們在生成常見音頻類別方面表現出色，而在罕見類別上表現不佳，因此降低了整體生成性能。我們將此問題稱為長尾文本轉語音生成。為解決此問題，我們提出了一種簡單的檢索增強方法，適用於TTA模型。具體來說，對於給定的輸入文本提示，我們首先利用對比語言音頻預訓練（CLAP）模型檢索相關的文本-音頻對。然後，所檢索音頻文本數據的特徵被用作額外條件，引導TTA模型的學習。我們通過我們提出的方法增強了AudioLDM，並將結果增強系統稱為Re-AudioLDM。在AudioCaps數據集上，Re-AudioLDM實現了1.37的最先進Frechet音頻距離（FAD），遠遠優於現有方法。此外，我們展示了Re-AudioLDM能夠為複雜場景、罕見音頻類別甚至未見過的音頻類型生成逼真的音頻，顯示其在TTA任務中的潛力。

English

Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer to this problem as long-tailed text-to-audio generation. To address this issue, we propose a simple retrieval-augmented approach for TTA models. Specifically, given an input text prompt, we first leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs. The features of the retrieved audio-text data are then used as additional conditions to guide the learning of TTA models. We enhance AudioLDM with our proposed approach and denote the resulting augmented system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the existing approaches by a large margin. Furthermore, we show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types, indicating its potential in TTA tasks.

檢索增強式文本轉語音生成

Retrieval-Augmented Text-to-Audio Generation

摘要

Support