非言語発声における話者同一性：条件付き蒸留と専門家混合アプローチ

要旨

表現的なテキスト音声合成（TTS）および音声変換（VC）システムが非言語発声（NVV）を生成して自然性を高めるにつれて、信頼性の高い話者照合（SV）が言語セグメントと非言語セグメントの両方にわたって同一性の一貫性を客観的に評価するために不可欠となっている。しかし、現在のSVシステムはNVVに対して汎化性能が低く、NVVデータでファインチューニングすると音声性能の破滅的忘却を引き起こす。本稿では、10種類のNVVにわたる初の体系的研究を提示し、凍結したData2Vec自己教師あり特徴量とECAPA-TDNNを組み合わせ、学習可能なドメイン認識ルーティングを備えた専門家混合（MoE）モジュールで強化したフレームワークを提案する。事前学習済み教師モデルを介した音声入力に対する条件付き蒸留損失により音声間の精度を維持し、対照損失によって音声とNVVのドメインギャップを埋める。本手法は、事前学習ベースラインと比較して音声-NVVの等価エラー率（EER）を38.93%から22.66%に低減し、蒸留により音声EERを13.17%から9.24%に改善する。

English

As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.