CapSpeech:赋能风格化字幕文本转语音的下游应用
CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech
June 3, 2025
作者: Helin Wang, Jiarui Hai, Dading Chong, Karan Thakkar, Tiantian Feng, Dongchao Yang, Junhyeok Lee, Laureano Moro Velazquez, Jesus Villalba, Zengyi Qin, Shrikanth Narayanan, Mounya Elhiali, Najim Dehak
cs.AI
摘要
近期,生成式人工智能的显著进展极大地推动了风格标注文本到语音合成(CapTTS)领域的发展。然而,由于缺乏标准化、全面的数据集以及对基于CapTTS的下游任务研究有限,将CapTTS应用于实际场景仍面临挑战。为填补这些空白,我们推出了CapSpeech,这是一个专为一系列CapTTS相关任务设计的新基准,包括带声音事件的风格标注文本到语音合成(CapTTS-SE)、口音标注TTS(AccCapTTS)、情感标注TTS(EmoCapTTS)以及聊天代理的文本到语音合成(AgentTTS)。CapSpeech包含了超过1000万对机器标注的音频-文本对及近36万对人类标注的音频-文本对。此外,我们还引入了两个由专业配音演员和经验丰富的音频工程师收集和录制的新数据集,专门针对AgentTTS和CapTTS-SE任务。伴随这些数据集,我们利用自回归和非自回归模型在CapSpeech上进行了全面的实验。结果表明,我们的方法能够实现高保真且高度清晰的语音合成,覆盖了多种说话风格。据我们所知,CapSpeech是目前最大的、为CapTTS相关任务提供全面标注的数据集。这些实验和发现进一步为开发CapTTS系统所面临的挑战提供了宝贵的见解。
English
Recent advancements in generative artificial intelligence have significantly
transformed the field of style-captioned text-to-speech synthesis (CapTTS).
However, adapting CapTTS to real-world applications remains challenging due to
the lack of standardized, comprehensive datasets and limited research on
downstream tasks built upon CapTTS. To address these gaps, we introduce
CapSpeech, a new benchmark designed for a series of CapTTS-related tasks,
including style-captioned text-to-speech synthesis with sound events
(CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS
(EmoCapTTS), and text-to-speech synthesis for chat agent (AgentTTS). CapSpeech
comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36
million human-annotated audio-caption pairs. In addition, we introduce two new
datasets collected and recorded by a professional voice actor and experienced
audio engineers, specifically for the AgentTTS and CapTTS-SE tasks. Alongside
the datasets, we conduct comprehensive experiments using both autoregressive
and non-autoregressive models on CapSpeech. Our results demonstrate
high-fidelity and highly intelligible speech synthesis across a diverse range
of speaking styles. To the best of our knowledge, CapSpeech is the largest
available dataset offering comprehensive annotations for CapTTS-related tasks.
The experiments and findings further provide valuable insights into the
challenges of developing CapTTS systems.