製作音訊 2:時間增強文本轉語音生成
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation
May 29, 2023
作者: Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, Zhou Zhao
cs.AI
摘要
大型擴散模型在文本轉語音(T2A)合成任務中取得成功,但常常面臨常見問題,如語義不一致和時間一致性差,這是由於自然語言理解有限和數據稀缺所導致的。此外,在T2A工作中廣泛使用的2D空間結構在生成可變長度音頻樣本時會導致音頻質量不佳,因為它們未能充分重視時間信息。為了應對這些挑戰,我們提出了基於潛在擴散的Make-an-Audio 2 T2A方法,它建立在Make-an-Audio的成功基礎上。我們的方法包括多種技術來改善語義對齊和時間一致性:首先,我們使用預訓練的大型語言模型(LLMs)將文本解析為結構化的<事件和順序>對,以更好地捕捉時間信息。我們還引入另一個結構化文本編碼器,以幫助在擴散去噪過程中學習語義對齊。為了改善可變長度生成的性能並增強時間信息提取,我們設計了一個基於前饋Transformer的擴散去噪器。最後,我們使用LLMs將大量音頻標籤數據擴充和轉換為音頻文本數據集,以緩解時間數據稀缺的問題。廣泛的實驗表明,我們的方法在客觀和主觀指標上優於基準模型,並在時間信息理解、語義一致性和音質方面取得顯著進展。
English
Large diffusion models have been successful in text-to-audio (T2A) synthesis
tasks, but they often suffer from common issues such as semantic misalignment
and poor temporal consistency due to limited natural language understanding and
data scarcity. Additionally, 2D spatial structures widely used in T2A works
lead to unsatisfactory audio quality when generating variable-length audio
samples since they do not adequately prioritize temporal information. To
address these challenges, we propose Make-an-Audio 2, a latent diffusion-based
T2A method that builds on the success of Make-an-Audio. Our approach includes
several techniques to improve semantic alignment and temporal consistency:
Firstly, we use pre-trained large language models (LLMs) to parse the text into
structured <event & order> pairs for better temporal information capture. We
also introduce another structured-text encoder to aid in learning semantic
alignment during the diffusion denoising process. To improve the performance of
variable length generation and enhance the temporal information extraction, we
design a feed-forward Transformer-based diffusion denoiser. Finally, we use
LLMs to augment and transform a large amount of audio-label data into
audio-text datasets to alleviate the problem of scarcity of temporal data.
Extensive experiments show that our method outperforms baseline models in both
objective and subjective metrics, and achieves significant gains in temporal
information understanding, semantic consistency, and sound quality.