SoloSpeech: 캐스케이드 생성 파이프라인을 통한 목표 음성 추출의 명료성 및 품질 향상

초록

타겟 음성 추출(Target Speech Extraction, TSE)은 일반적으로 보조 오디오(일명 큐 오디오)로 제공되는 화자 특정 단서를 활용하여 여러 화자의 혼합 음성에서 타겟 화자의 목소리를 분리하는 것을 목표로 합니다. 최근 TSE 분야의 발전은 주로 높은 지각적 품질을 제공하는 판별 모델을 중심으로 이루어졌지만, 이러한 모델들은 종종 원치 않는 아티팩트를 유발하고 자연스러움을 저하시키며, 훈련과 테스트 환경 간의 불일치에 민감한 문제를 가지고 있습니다. 반면, TSE를 위한 생성 모델은 지각적 품질과 명료성 측면에서 뒤처지는 경향이 있습니다. 이러한 문제를 해결하기 위해, 우리는 압축, 추출, 재구성 및 보정 과정을 통합한 새로운 캐스케이드 생성 파이프라인인 SoloSpeech를 제안합니다. SoloSpeech는 큐 오디오의 잠재 공간에서 조건 정보를 활용하여 혼합 오디오의 잠재 공간과 정렬함으로써 불일치를 방지하는 화자 임베딩이 없는 타겟 추출기를 특징으로 합니다. 널리 사용되는 Libri2Mix 데이터셋에서 평가된 결과, SoloSpeech는 타겟 음성 추출 및 음성 분리 작업에서 최신의 최고 수준의 명료성과 품질을 달성했으며, 도메인 외 데이터와 실제 시나리오에서도 탁월한 일반화 능력을 보여주었습니다.

English

Target Speech Extraction (TSE) aims to isolate a target speaker's voice from a mixture of multiple speakers by leveraging speaker-specific cues, typically provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are sensitive to discrepancies between training and testing environments. On the other hand, generative models for TSE lag in perceptual quality and intelligibility. To address these challenges, we present SoloSpeech, a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech features a speaker-embedding-free target extractor that utilizes conditional information from the cue audio's latent space, aligning it with the mixture audio's latent space to prevent mismatches. Evaluated on the widely-used Libri2Mix dataset, SoloSpeech achieves the new state-of-the-art intelligibility and quality in target speech extraction and speech separation tasks while demonstrating exceptional generalization on out-of-domain data and real-world scenarios.