NVSpeech：一個整合且可擴展的人類語音模擬管道，包含副語言發聲

摘要

副語言發聲——包括如笑聲和呼吸等非言語聲音，以及如“嗯”和“哦”等詞彙化的感嘆詞——是自然口語交流中不可或缺的部分。儘管這些提示在傳達情感、意圖和互動信號方面具有重要意義，但在傳統的自動語音識別（ASR）和文本轉語音（TTS）系統中，這些提示大多被忽視。我們提出了NVSpeech，這是一個集成且可擴展的管道，它橋接了副語言發聲的識別與合成，涵蓋了數據集構建、ASR建模和可控的TTS。（1）我們引入了一個手動標註的數據集，包含48,430條人類語音發聲，涉及18個詞彙級別的副語言類別。（2）我們開發了副語言感知的ASR模型，該模型將副語言提示視為內聯可解碼的標記（例如，“你真有趣[笑聲]”），從而實現詞彙和非言語的聯合轉錄。該模型隨後用於自動標註一個大型語料庫，這是首個大規模的中文數據集，包含174,179條發聲（573小時），並帶有詞彙級別的對齊和副語言提示。（3）我們在人工標註和自動標註的數據上微調零樣本TTS模型，以實現對副語言發聲的顯式控制，允許在任意標記位置進行上下文感知的插入，以實現類人語音合成。通過統一副語言發聲的識別與生成，NVSpeech提供了首個開放、大規模、詞彙級別標註的管道，用於普通話表達性語音建模，以可擴展和可控的方式整合了識別與合成。數據集和音頻演示可在https://nvspeech170k.github.io/獲取。

English

Paralinguistic vocalizations-including non-verbal sounds like laughter and breathing, as well as lexicalized interjections such as "uhm" and "oh"-are integral to natural spoken communication. Despite their importance in conveying affect, intent, and interactional cues, such cues remain largely overlooked in conventional automatic speech recognition (ASR) and text-to-speech (TTS) systems. We present NVSpeech, an integrated and scalable pipeline that bridges the recognition and synthesis of paralinguistic vocalizations, encompassing dataset construction, ASR modeling, and controllable TTS. (1) We introduce a manually annotated dataset of 48,430 human-spoken utterances with 18 word-level paralinguistic categories. (2) We develop the paralinguistic-aware ASR model, which treats paralinguistic cues as inline decodable tokens (e.g., "You're so funny [Laughter]"), enabling joint lexical and non-verbal transcription. This model is then used to automatically annotate a large corpus, the first large-scale Chinese dataset of 174,179 utterances (573 hours) with word-level alignment and paralingustic cues. (3) We finetune zero-shot TTS models on both human- and auto-labeled data to enable explicit control over paralinguistic vocalizations, allowing context-aware insertion at arbitrary token positions for human-like speech synthesis. By unifying the recognition and generation of paralinguistic vocalizations, NVSpeech offers the first open, large-scale, word-level annotated pipeline for expressive speech modeling in Mandarin, integrating recognition and synthesis in a scalable and controllable manner. Dataset and audio demos are available at https://nvspeech170k.github.io/.

NVSpeech：一個整合且可擴展的人類語音模擬管道，包含副語言發聲

NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations

摘要

Support