Make-An-Audio 2: 時間的強化を施したテキストからオーディオ生成

要旨

大規模な拡散モデルはテキストから音声（T2A）合成タスクで成功を収めてきたが、自然言語理解の限界やデータ不足により、意味的な不整合や時間的一貫性の欠如といった共通の問題に悩まされることが多い。さらに、T2A研究で広く使用されている2D空間構造は、時間情報を十分に優先しないため、可変長の音声サンプルを生成する際に不満足な音質をもたらす。これらの課題に対処するため、我々はMake-an-Audioの成功を基盤とした潜在拡散ベースのT2A手法であるMake-an-Audio 2を提案する。本手法では、意味的整合性と時間的一貫性を改善するためのいくつかの技術を導入している。まず、事前学習済みの大規模言語モデル（LLM）を使用してテキストを構造化された<イベント＆順序>ペアに解析し、時間情報の捕捉を向上させる。また、拡散ノイズ除去プロセス中に意味的整合性の学習を支援するため、別の構造化テキストエンコーダを導入する。可変長生成の性能向上と時間情報抽出の強化のために、フィードフォワード型のTransformerベースの拡散ノイズ除去器を設計する。最後に、LLMを使用して大量の音声ラベルデータを音声-テキストデータセットに拡張・変換し、時間データの不足問題を緩和する。大規模な実験により、本手法がベースラインモデルを客観的および主観的指標の両方で上回り、時間情報の理解、意味的一貫性、音質において大幅な向上を達成することが示された。

English

Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since they do not adequately prioritize temporal information. To address these challenges, we propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio. Our approach includes several techniques to improve semantic alignment and temporal consistency: Firstly, we use pre-trained large language models (LLMs) to parse the text into structured <event & order> pairs for better temporal information capture. We also introduce another structured-text encoder to aid in learning semantic alignment during the diffusion denoising process. To improve the performance of variable length generation and enhance the temporal information extraction, we design a feed-forward Transformer-based diffusion denoiser. Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data. Extensive experiments show that our method outperforms baseline models in both objective and subjective metrics, and achieves significant gains in temporal information understanding, semantic consistency, and sound quality.

Make-An-Audio 2: 時間的強化を施したテキストからオーディオ生成

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

要旨

Support