SoloSpeech：通過級聯生成管道提升目標語音提取的清晰度與質量

摘要

目標語音提取（Target Speech Extraction, TSE）旨在通過利用特定於說話者的線索（通常以輔助音頻形式提供，即提示音頻），從多位說話者的混合音頻中分離出目標說話者的聲音。儘管近期TSE的進展主要依賴於提供高感知質量的判別模型，這些模型往往會引入不必要的偽影、降低自然度，並且對訓練與測試環境之間的差異敏感。另一方面，生成模型在TSE中的感知質量和清晰度方面則相對落後。為應對這些挑戰，我們提出了SoloSpeech，一種新穎的級聯生成管道，整合了壓縮、提取、重建和校正過程。SoloSpeech採用了一種無需說話者嵌入的目標提取器，該提取器利用提示音頻潛在空間中的條件信息，並將其與混合音頻的潛在空間對齊，以防止不匹配。在廣泛使用的Libri2Mix數據集上進行評估，SoloSpeech在目標語音提取和語音分離任務中實現了新的最優清晰度和質量，同時在域外數據和實際場景中展現出卓越的泛化能力。

English

Target Speech Extraction (TSE) aims to isolate a target speaker's voice from a mixture of multiple speakers by leveraging speaker-specific cues, typically provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are sensitive to discrepancies between training and testing environments. On the other hand, generative models for TSE lag in perceptual quality and intelligibility. To address these challenges, we present SoloSpeech, a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech features a speaker-embedding-free target extractor that utilizes conditional information from the cue audio's latent space, aligning it with the mixture audio's latent space to prevent mismatches. Evaluated on the widely-used Libri2Mix dataset, SoloSpeech achieves the new state-of-the-art intelligibility and quality in target speech extraction and speech separation tasks while demonstrating exceptional generalization on out-of-domain data and real-world scenarios.

SoloSpeech：通過級聯生成管道提升目標語音提取的清晰度與質量

SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

摘要

Support