非言语发声中的说话人身份：条件蒸馏与专家混合方法

摘要

随着表现力丰富的文本转语音（TTS）和语音转换（VC）系统越来越多地生成非言语发声（NVV）以增强自然度，可靠的说话人验证（SV）对于客观评估言语与非言语片段之间的身份一致性变得至关重要。然而，当前SV系统对NVV的泛化能力较差，且针对NVV数据进行微调会导致言语性能的灾难性遗忘。我们首次对10种NVV类型进行了系统性研究，并提出一种框架，将冻结的Data2Vec自监督特征与ECAPA-TDNN相结合，并通过带有学习型领域感知路由的专家混合（MoE）模块加以增强。利用预训练教师模型对言语输入施加条件蒸馏损失，以保持言语到言语的准确度，同时通过对比损失弥合言语与NVV之间的领域差距。我们的方法在预训练基线基础上，将言语到NVV的等错误率（EER）从38.93%降至22.66%，并通过蒸馏将言语EER从13.17%提升至9.24%。

English

As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.