SoloSpeech:通过级联生成式管道提升目标语音提取的清晰度与质量
SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline
May 25, 2025
作者: Helin Wang, Jiarui Hai, Dongchao Yang, Chen Chen, Kai Li, Junyi Peng, Thomas Thebaud, Laureano Moro Velazquez, Jesus Villalba, Najim Dehak
cs.AI
摘要
目标语音提取(Target Speech Extraction, TSE)旨在通过利用特定于说话者的线索,通常以辅助音频(即提示音频)形式提供,从多位说话者的混合语音中分离出目标说话者的声音。尽管近期TSE的进展主要依赖于提供高感知质量的判别模型,但这些模型常常引入不必要的伪影,降低自然度,并对训练与测试环境之间的差异敏感。另一方面,生成模型在TSE任务中的感知质量和清晰度方面表现欠佳。为解决这些挑战,我们提出了SoloSpeech,一种新颖的级联生成管道,集成了压缩、提取、重建和校正过程。SoloSpeech采用了一种无需说话者嵌入的目标提取器,它利用提示音频潜在空间中的条件信息,并将其与混合音频的潜在空间对齐,以防止不匹配。在广泛使用的Libri2Mix数据集上评估,SoloSpeech在目标语音提取和语音分离任务中实现了新的最先进的清晰度和质量,同时展示了在域外数据和真实场景中的卓越泛化能力。
English
Target Speech Extraction (TSE) aims to isolate a target speaker's voice from
a mixture of multiple speakers by leveraging speaker-specific cues, typically
provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in
TSE have primarily employed discriminative models that offer high perceptual
quality, these models often introduce unwanted artifacts, reduce naturalness,
and are sensitive to discrepancies between training and testing environments.
On the other hand, generative models for TSE lag in perceptual quality and
intelligibility. To address these challenges, we present SoloSpeech, a novel
cascaded generative pipeline that integrates compression, extraction,
reconstruction, and correction processes. SoloSpeech features a
speaker-embedding-free target extractor that utilizes conditional information
from the cue audio's latent space, aligning it with the mixture audio's latent
space to prevent mismatches. Evaluated on the widely-used Libri2Mix dataset,
SoloSpeech achieves the new state-of-the-art intelligibility and quality in
target speech extraction and speech separation tasks while demonstrating
exceptional generalization on out-of-domain data and real-world scenarios.Summary
AI-Generated Summary