生成音频2:时序增强文本转语音生成
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation
May 29, 2023
作者: Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, Zhou Zhao
cs.AI
摘要
大规模扩散模型在文本转音频(T2A)合成任务中取得了成功,但通常存在常见问题,如语义不对齐和时间一致性差,这是由于自然语言理解有限和数据稀缺所致。此外,在T2A工作中广泛使用的二维空间结构在生成可变长度音频样本时会导致音频质量不佳,因为它们未能充分优先考虑时间信息。为了解决这些挑战,我们提出了基于潜在扩散的Make-an-Audio 2 T2A方法,它在Make-an-Audio的成功基础上构建。我们的方法包括多种技术来改善语义对齐和时间一致性:首先,我们使用预训练的大型语言模型(LLMs)将文本解析为结构化的<事件和顺序>对,以更好地捕获时间信息。我们还引入另一个结构化文本编码器,在扩散去噪过程中帮助学习语义对齐。为了提高可变长度生成的性能并增强时间信息提取,我们设计了一个基于前馈Transformer的扩散去噪器。最后,我们使用LLMs将大量音频标签数据增强和转换为音频文本数据集,以缓解时间数据稀缺的问题。大量实验证明,我们的方法在客观和主观指标上优于基准模型,并在时间信息理解、语义一致性和音质方面取得显著进展。
English
Large diffusion models have been successful in text-to-audio (T2A) synthesis
tasks, but they often suffer from common issues such as semantic misalignment
and poor temporal consistency due to limited natural language understanding and
data scarcity. Additionally, 2D spatial structures widely used in T2A works
lead to unsatisfactory audio quality when generating variable-length audio
samples since they do not adequately prioritize temporal information. To
address these challenges, we propose Make-an-Audio 2, a latent diffusion-based
T2A method that builds on the success of Make-an-Audio. Our approach includes
several techniques to improve semantic alignment and temporal consistency:
Firstly, we use pre-trained large language models (LLMs) to parse the text into
structured <event & order> pairs for better temporal information capture. We
also introduce another structured-text encoder to aid in learning semantic
alignment during the diffusion denoising process. To improve the performance of
variable length generation and enhance the temporal information extraction, we
design a feed-forward Transformer-based diffusion denoiser. Finally, we use
LLMs to augment and transform a large amount of audio-label data into
audio-text datasets to alleviate the problem of scarcity of temporal data.
Extensive experiments show that our method outperforms baseline models in both
objective and subjective metrics, and achieves significant gains in temporal
information understanding, semantic consistency, and sound quality.