NVSpeech：一种集成且可扩展的类人语音建模管道，支持副语言发声

摘要

副语言发声——包括笑声、呼吸等非语言声音，以及“嗯”、“哦”等词汇化感叹词——是自然口语交流中不可或缺的组成部分。尽管这些提示在传达情感、意图和互动线索方面至关重要，但在传统的自动语音识别（ASR）和文本转语音（TTS）系统中，这些线索大多被忽视。我们提出了NVSpeech，一个集成且可扩展的管道，它桥接了副语言发声的识别与合成，涵盖了数据集构建、ASR建模和可控TTS。（1）我们引入了一个包含48,430条人类语音、标注有18个词级副语言类别的手工标注数据集。（2）我们开发了副语言感知的ASR模型，该模型将副语言线索视为可解码的内联标记（例如，“你真有趣[笑声]”），实现了词汇与非语言转录的联合处理。此模型随后用于自动标注一个大规模语料库，这是首个包含174,179条话语（573小时）、具有词级对齐和副语言提示的大规模中文数据集。（3）我们在人工标注和自动标注的数据上微调零样本TTS模型，以实现对副语言发声的显式控制，允许在任意标记位置进行上下文感知的插入，以生成拟人化的语音合成。通过统一副语言发声的识别与生成，NVSpeech为普通话表达性语音建模提供了首个开放、大规模、词级标注的管道，以可扩展且可控的方式整合了识别与合成。数据集及音频演示可在https://nvspeech170k.github.io/获取。

English

Paralinguistic vocalizations-including non-verbal sounds like laughter and breathing, as well as lexicalized interjections such as "uhm" and "oh"-are integral to natural spoken communication. Despite their importance in conveying affect, intent, and interactional cues, such cues remain largely overlooked in conventional automatic speech recognition (ASR) and text-to-speech (TTS) systems. We present NVSpeech, an integrated and scalable pipeline that bridges the recognition and synthesis of paralinguistic vocalizations, encompassing dataset construction, ASR modeling, and controllable TTS. (1) We introduce a manually annotated dataset of 48,430 human-spoken utterances with 18 word-level paralinguistic categories. (2) We develop the paralinguistic-aware ASR model, which treats paralinguistic cues as inline decodable tokens (e.g., "You're so funny [Laughter]"), enabling joint lexical and non-verbal transcription. This model is then used to automatically annotate a large corpus, the first large-scale Chinese dataset of 174,179 utterances (573 hours) with word-level alignment and paralingustic cues. (3) We finetune zero-shot TTS models on both human- and auto-labeled data to enable explicit control over paralinguistic vocalizations, allowing context-aware insertion at arbitrary token positions for human-like speech synthesis. By unifying the recognition and generation of paralinguistic vocalizations, NVSpeech offers the first open, large-scale, word-level annotated pipeline for expressive speech modeling in Mandarin, integrating recognition and synthesis in a scalable and controllable manner. Dataset and audio demos are available at https://nvspeech170k.github.io/.

NVSpeech：一种集成且可扩展的类人语音建模管道，支持副语言发声

NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations

摘要

Support