ChatPaper.aiChatPaper

非言语发声中的说话人身份:条件蒸馏与专家混合方法

Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach

June 19, 2026
作者: Tzu-Chieh Wei, Yi-Cheng Lin, Huang-Cheng Chou, Kuan-Yu Chen, Hsin-Yen Sung, Shrikanth Narayanan, Hung-yi Lee
cs.AI

摘要

随着表现力丰富的文本转语音(TTS)和语音转换(VC)系统越来越多地生成非言语发声(NVV)以增强自然度,可靠的说话人验证(SV)对于客观评估言语与非言语片段之间的身份一致性变得至关重要。然而,当前SV系统对NVV的泛化能力较差,且针对NVV数据进行微调会导致言语性能的灾难性遗忘。我们首次对10种NVV类型进行了系统性研究,并提出一种框架,将冻结的Data2Vec自监督特征与ECAPA-TDNN相结合,并通过带有学习型领域感知路由的专家混合(MoE)模块加以增强。利用预训练教师模型对言语输入施加条件蒸馏损失,以保持言语到言语的准确度,同时通过对比损失弥合言语与NVV之间的领域差距。我们的方法在预训练基线基础上,将言语到NVV的等错误率(EER)从38.93%降至22.66%,并通过蒸馏将言语EER从13.17%提升至9.24%。
English
As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.