비언어적 발성에서의 화자 식별: 조건부 증류 및 전문가 혼합 접근법

초록

표현적 텍스트-음성 변환(TTS) 및 음성 변환(VC) 시스템이 자연스러움을 향상시키기 위해 비언어적 발성(NVV)을 점점 더 많이 생성함에 따라, 언어적 및 비언어적 세그먼트 모두에서 정체성 일관성을 객관적으로 평가하기 위해 신뢰할 수 있는 화자 검증(SV)이 필수적이게 되었다. 그러나 현재의 SV 시스템은 NVV에 대해 일반화 성능이 낮으며, NVV 데이터로 미세 조정하면 음성 성능에 대한 치명적 망각(catastrophic forgetting)이 발생한다. 우리는 10가지 NVV 유형에 대한 최초의 체계적 연구를 제시하며, 학습된 도메인 인식 라우팅을 갖춘 전문가 혼합(MoE) 모듈로 강화된, 고정된 Data2Vec 자기 지도 특징과 ECAPA-TDNN을 결합한 프레임워크를 제안한다. 사전 훈련된 교사 모델을 통한 음성 입력에 대한 조건부 증류 손실(conditional distillation loss)은 음성-음성 정확도를 유지하는 반면, 대조 손실(contrastive loss)은 음성-NVV 도메인 격차를 해소한다. 제안 방법은 사전 훈련된 기준선 대비 음성-NVV EER을 38.93%에서 22.66%로 감소시키고, 증류를 통해 음성 EER을 13.17%에서 9.24%로 개선한다.

English

As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.