NVSpeech: Een geïntegreerde en schaalbare pijplijn voor mensachtige spraakmodellering met paralinguïstische vocalisaties

Samenvatting

Paralinguïstische vocalisaties - inclusief non-verbale geluiden zoals gelach en ademhaling, evenals gelexicaliseerde tussenwerpsels zoals "uhm" en "oh" - zijn essentieel voor natuurlijke gesproken communicatie. Ondanks hun belang bij het overbrengen van emoties, intenties en interactieve signalen, worden dergelijke signalen grotendeels over het hoofd gezien in conventionele automatische spraakherkenning (ASR) en tekst-naar-spraak (TTS) systemen. Wij presenteren NVSpeech, een geïntegreerde en schaalbare pijplijn die de herkenning en synthese van paralinguïstische vocalisaties overbrugt, met inbegrip van datasetconstructie, ASR-modellering en controleerbare TTS. (1) We introduceren een handmatig geannoteerde dataset van 48.430 door mensen gesproken uitingen met 18 woordniveau paralinguïstische categorieën. (2) We ontwikkelen het paralinguïstisch bewuste ASR-model, dat paralinguïstische signalen behandelt als inline decodeerbare tokens (bijv. "Je bent zo grappig [Gelach]"), waardoor gezamenlijke lexicale en non-verbale transcriptie mogelijk wordt. Dit model wordt vervolgens gebruikt om automatisch een grote corpus te annoteren, de eerste grootschalige Chinese dataset van 174.179 uitingen (573 uur) met woordniveau uitlijning en paralinguïstische signalen. (3) We finetunen zero-shot TTS-modellen op zowel handmatig als automatisch gelabelde data om expliciete controle over paralinguïstische vocalisaties mogelijk te maken, waardoor contextbewuste invoeging op willekeurige tokenposities voor mensachtige spraaksynthese wordt toegestaan. Door de herkenning en generatie van paralinguïstische vocalisaties te verenigen, biedt NVSpeech de eerste open, grootschalige, woordniveau geannoteerde pijplijn voor expressieve spraakmodellering in het Mandarijn, waarbij herkenning en synthese op een schaalbare en controleerbare manier worden geïntegreerd. Dataset en audio demo's zijn beschikbaar op https://nvspeech170k.github.io/.

English

Paralinguistic vocalizations-including non-verbal sounds like laughter and breathing, as well as lexicalized interjections such as "uhm" and "oh"-are integral to natural spoken communication. Despite their importance in conveying affect, intent, and interactional cues, such cues remain largely overlooked in conventional automatic speech recognition (ASR) and text-to-speech (TTS) systems. We present NVSpeech, an integrated and scalable pipeline that bridges the recognition and synthesis of paralinguistic vocalizations, encompassing dataset construction, ASR modeling, and controllable TTS. (1) We introduce a manually annotated dataset of 48,430 human-spoken utterances with 18 word-level paralinguistic categories. (2) We develop the paralinguistic-aware ASR model, which treats paralinguistic cues as inline decodable tokens (e.g., "You're so funny [Laughter]"), enabling joint lexical and non-verbal transcription. This model is then used to automatically annotate a large corpus, the first large-scale Chinese dataset of 174,179 utterances (573 hours) with word-level alignment and paralingustic cues. (3) We finetune zero-shot TTS models on both human- and auto-labeled data to enable explicit control over paralinguistic vocalizations, allowing context-aware insertion at arbitrary token positions for human-like speech synthesis. By unifying the recognition and generation of paralinguistic vocalizations, NVSpeech offers the first open, large-scale, word-level annotated pipeline for expressive speech modeling in Mandarin, integrating recognition and synthesis in a scalable and controllable manner. Dataset and audio demos are available at https://nvspeech170k.github.io/.

NVSpeech: Een geïntegreerde en schaalbare pijplijn voor mensachtige spraakmodellering met paralinguïstische vocalisaties

NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations

Samenvatting

Support