生成音频2：时序增强文本转语音生成

摘要

大规模扩散模型在文本转音频（T2A）合成任务中取得了成功，但通常存在常见问题，如语义不对齐和时间一致性差，这是由于自然语言理解有限和数据稀缺所致。此外，在T2A工作中广泛使用的二维空间结构在生成可变长度音频样本时会导致音频质量不佳，因为它们未能充分优先考虑时间信息。为了解决这些挑战，我们提出了基于潜在扩散的Make-an-Audio 2 T2A方法，它在Make-an-Audio的成功基础上构建。我们的方法包括多种技术来改善语义对齐和时间一致性：首先，我们使用预训练的大型语言模型（LLMs）将文本解析为结构化的<事件和顺序>对，以更好地捕获时间信息。我们还引入另一个结构化文本编码器，在扩散去噪过程中帮助学习语义对齐。为了提高可变长度生成的性能并增强时间信息提取，我们设计了一个基于前馈Transformer的扩散去噪器。最后，我们使用LLMs将大量音频标签数据增强和转换为音频文本数据集，以缓解时间数据稀缺的问题。大量实验证明，我们的方法在客观和主观指标上优于基准模型，并在时间信息理解、语义一致性和音质方面取得显著进展。

English

Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since they do not adequately prioritize temporal information. To address these challenges, we propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio. Our approach includes several techniques to improve semantic alignment and temporal consistency: Firstly, we use pre-trained large language models (LLMs) to parse the text into structured <event & order> pairs for better temporal information capture. We also introduce another structured-text encoder to aid in learning semantic alignment during the diffusion denoising process. To improve the performance of variable length generation and enhance the temporal information extraction, we design a feed-forward Transformer-based diffusion denoiser. Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data. Extensive experiments show that our method outperforms baseline models in both objective and subjective metrics, and achieves significant gains in temporal information understanding, semantic consistency, and sound quality.

生成音频2：时序增强文本转语音生成

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

摘要

Support