CapSpeech:賦能風格化字幕文本轉語音的下游應用
CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech
June 3, 2025
作者: Helin Wang, Jiarui Hai, Dading Chong, Karan Thakkar, Tiantian Feng, Dongchao Yang, Junhyeok Lee, Laureano Moro Velazquez, Jesus Villalba, Zengyi Qin, Shrikanth Narayanan, Mounya Elhiali, Najim Dehak
cs.AI
摘要
近期,生成式人工智慧的進步顯著改變了風格標註文字轉語音合成(CapTTS)領域。然而,由於缺乏標準化、全面的數據集以及基於CapTTS的下游任務研究有限,將CapTTS應用於實際場景仍面臨挑戰。為填補這些空白,我們推出了CapSpeech,這是一個專為一系列CapTTS相關任務設計的新基準,包括帶有聲音事件的風格標註文字轉語音合成(CapTTS-SE)、口音標註TTS(AccCapTTS)、情感標註TTS(EmoCapTTS)以及聊天代理的文字轉語音合成(AgentTTS)。CapSpeech包含超過1000萬對機器標註的音頻-標註對和近36萬對人工標註的音頻-標註對。此外,我們還引入了兩個由專業配音演員和經驗豐富的音頻工程師收集和錄製的新數據集,專門用於AgentTTS和CapTTS-SE任務。伴隨這些數據集,我們在CapSpeech上使用自回歸和非自回歸模型進行了全面的實驗。我們的結果展示了在多樣化說話風格下高保真度和高清晰度的語音合成。據我們所知,CapSpeech是目前最大的提供CapTTS相關任務全面標註的數據集。這些實驗和發現進一步為開發CapTTS系統的挑戰提供了寶貴的見解。
English
Recent advancements in generative artificial intelligence have significantly
transformed the field of style-captioned text-to-speech synthesis (CapTTS).
However, adapting CapTTS to real-world applications remains challenging due to
the lack of standardized, comprehensive datasets and limited research on
downstream tasks built upon CapTTS. To address these gaps, we introduce
CapSpeech, a new benchmark designed for a series of CapTTS-related tasks,
including style-captioned text-to-speech synthesis with sound events
(CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS
(EmoCapTTS), and text-to-speech synthesis for chat agent (AgentTTS). CapSpeech
comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36
million human-annotated audio-caption pairs. In addition, we introduce two new
datasets collected and recorded by a professional voice actor and experienced
audio engineers, specifically for the AgentTTS and CapTTS-SE tasks. Alongside
the datasets, we conduct comprehensive experiments using both autoregressive
and non-autoregressive models on CapSpeech. Our results demonstrate
high-fidelity and highly intelligible speech synthesis across a diverse range
of speaking styles. To the best of our knowledge, CapSpeech is the largest
available dataset offering comprehensive annotations for CapTTS-related tasks.
The experiments and findings further provide valuable insights into the
challenges of developing CapTTS systems.