CapSpeech: スタイルキャプション付きテキスト読み上げにおける下流アプリケーションの実現

要旨

近年の生成AIの進歩により、スタイルキャプション付きテキスト音声合成（CapTTS）の分野は大きく変貌を遂げました。しかし、標準化された包括的なデータセットの不足や、CapTTSを基盤とした下流タスクに関する研究が限られているため、CapTTSを実世界のアプリケーションに適応させることは依然として課題となっています。これらのギャップを埋めるため、我々はCapSpeechという新しいベンチマークを導入しました。CapSpeechは、音響イベント付きスタイルキャプション付きテキスト音声合成（CapTTS-SE）、アクセントキャプション付きTTS（AccCapTTS）、感情キャプション付きTTS（EmoCapTTS）、チャットエージェント向けテキスト音声合成（AgentTTS）など、CapTTS関連の一連のタスクを対象としています。CapSpeechは、1000万以上の機械注釈付きオーディオキャプションペアと、約36万の人間注釈付きオーディオキャプションペアで構成されています。さらに、AgentTTSとCapTTS-SEタスク向けに、プロの声優と経験豊富な音響エンジニアによって収集・録音された2つの新しいデータセットを導入しました。データセットに加えて、CapSpeech上で自己回帰モデルと非自己回帰モデルを用いた包括的な実験を実施しました。その結果、多様な話し方において高忠実度かつ高明瞭度の音声合成が実現できることが示されました。我々の知る限り、CapSpeechはCapTTS関連タスク向けの包括的な注釈を提供する最大のデータセットです。実験とその結果は、CapTTSシステム開発の課題に対する貴重な知見をさらに提供します。

English

Recent advancements in generative artificial intelligence have significantly transformed the field of style-captioned text-to-speech synthesis (CapTTS). However, adapting CapTTS to real-world applications remains challenging due to the lack of standardized, comprehensive datasets and limited research on downstream tasks built upon CapTTS. To address these gaps, we introduce CapSpeech, a new benchmark designed for a series of CapTTS-related tasks, including style-captioned text-to-speech synthesis with sound events (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS), and text-to-speech synthesis for chat agent (AgentTTS). CapSpeech comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36 million human-annotated audio-caption pairs. In addition, we introduce two new datasets collected and recorded by a professional voice actor and experienced audio engineers, specifically for the AgentTTS and CapTTS-SE tasks. Alongside the datasets, we conduct comprehensive experiments using both autoregressive and non-autoregressive models on CapSpeech. Our results demonstrate high-fidelity and highly intelligible speech synthesis across a diverse range of speaking styles. To the best of our knowledge, CapSpeech is the largest available dataset offering comprehensive annotations for CapTTS-related tasks. The experiments and findings further provide valuable insights into the challenges of developing CapTTS systems.

CapSpeech: スタイルキャプション付きテキスト読み上げにおける下流アプリケーションの実現

CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

要旨

Support