NVSpeech：パラ言語的発声を伴う人間らしい音声モデリングのための統合型かつスケーラブルなパイプライン

要旨

パラ言語的音声表現―笑いや呼吸などの非言語音、および「えーと」や「ああ」といった語彙化された間投詞を含む―は、自然な音声コミュニケーションにおいて不可欠な要素である。これらの音声表現は、感情、意図、および相互作用の手がかりを伝える上で重要な役割を果たすにもかかわらず、従来の自動音声認識（ASR）やテキスト音声合成（TTS）システムではほとんど注目されてこなかった。本論文では、パラ言語的音声表現の認識と合成を統合し、データセット構築、ASRモデリング、制御可能なTTSを包括する、拡張性のあるパイプラインであるNVSpeechを提案する。(1) 18の単語レベルのパラ言語的カテゴリを含む48,430の発話からなる手動アノテーション済みデータセットを紹介する。(2) パラ言語的認識を可能にするASRモデルを開発し、パラ言語的手がかりをインラインでデコード可能なトークン（例：「あなたは面白い[笑い]」）として扱い、語彙的および非言語的転写を同時に行う。このモデルを用いて、単語レベルのアライメントとパラ言語的手がかりを含む174,179の発話（573時間）からなる大規模な中国語データセットを自動アノテーションする。(3) 人間によるアノテーションおよび自動アノテーションされたデータを用いてゼロショットTTSモデルを微調整し、パラ言語的音声表現を明示的に制御し、任意のトークン位置に文脈を考慮した挿入を可能にすることで、人間らしい音声合成を実現する。NVSpeechは、パラ言語的音声表現の認識と生成を統合し、中国語における表現豊かな音声モデリングのための初めてのオープンで大規模な単語レベルアノテーション済みパイプラインを提供し、認識と合成を拡張性と制御性を備えた形で統合する。データセットと音声デモはhttps://nvspeech170k.github.io/で公開されている。

English

Paralinguistic vocalizations-including non-verbal sounds like laughter and breathing, as well as lexicalized interjections such as "uhm" and "oh"-are integral to natural spoken communication. Despite their importance in conveying affect, intent, and interactional cues, such cues remain largely overlooked in conventional automatic speech recognition (ASR) and text-to-speech (TTS) systems. We present NVSpeech, an integrated and scalable pipeline that bridges the recognition and synthesis of paralinguistic vocalizations, encompassing dataset construction, ASR modeling, and controllable TTS. (1) We introduce a manually annotated dataset of 48,430 human-spoken utterances with 18 word-level paralinguistic categories. (2) We develop the paralinguistic-aware ASR model, which treats paralinguistic cues as inline decodable tokens (e.g., "You're so funny [Laughter]"), enabling joint lexical and non-verbal transcription. This model is then used to automatically annotate a large corpus, the first large-scale Chinese dataset of 174,179 utterances (573 hours) with word-level alignment and paralingustic cues. (3) We finetune zero-shot TTS models on both human- and auto-labeled data to enable explicit control over paralinguistic vocalizations, allowing context-aware insertion at arbitrary token positions for human-like speech synthesis. By unifying the recognition and generation of paralinguistic vocalizations, NVSpeech offers the first open, large-scale, word-level annotated pipeline for expressive speech modeling in Mandarin, integrating recognition and synthesis in a scalable and controllable manner. Dataset and audio demos are available at https://nvspeech170k.github.io/.

NVSpeech：パラ言語的発声を伴う人間らしい音声モデリングのための統合型かつスケーラブルなパイプライン

NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations

要旨

Support