SoloSpeech: カスケード生成パイプラインによるターゲット音声抽出の明瞭性と品質の向上

要旨

ターゲット音声抽出（Target Speech Extraction, TSE）は、複数の話者の音声が混ざった状態から、特定の話者の音声を分離することを目的とし、通常は補助音声（cue audio）として提供される話者固有の手がかりを活用します。近年のTSEの進歩は主に識別モデルを中心に進んでおり、高い知覚品質を提供しますが、これらのモデルはしばしば望ましくないアーティファクトを導入し、自然さを損ない、学習環境とテスト環境の不一致に敏感です。一方、TSEのための生成モデルは、知覚品質と明瞭さの点で遅れを取っています。これらの課題に対処するため、我々はSoloSpeechを提案します。これは、圧縮、抽出、再構築、修正のプロセスを統合した新しいカスケード型生成パイプラインです。SoloSpeechは、cue audioの潜在空間からの条件情報を利用し、混合音声の潜在空間と整合させることでミスマッチを防ぐ、話者埋め込み不要のターゲット抽出器を特徴としています。広く使用されているLibri2Mixデータセットで評価された結果、SoloSpeechはターゲット音声抽出および音声分離タスクにおいて、新たな最先端の明瞭さと品質を達成し、ドメイン外データや実世界のシナリオにおいても優れた汎化性能を示しました。

English

Target Speech Extraction (TSE) aims to isolate a target speaker's voice from a mixture of multiple speakers by leveraging speaker-specific cues, typically provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are sensitive to discrepancies between training and testing environments. On the other hand, generative models for TSE lag in perceptual quality and intelligibility. To address these challenges, we present SoloSpeech, a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech features a speaker-embedding-free target extractor that utilizes conditional information from the cue audio's latent space, aligning it with the mixture audio's latent space to prevent mismatches. Evaluated on the widely-used Libri2Mix dataset, SoloSpeech achieves the new state-of-the-art intelligibility and quality in target speech extraction and speech separation tasks while demonstrating exceptional generalization on out-of-domain data and real-world scenarios.

SoloSpeech: カスケード生成パイプラインによるターゲット音声抽出の明瞭性と品質の向上

SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

要旨

Support