音声分離の進展：技術、課題、そして将来のトレンド

要旨

「カクテルパーティ問題」に対処する音声分離の分野は、深層ニューラルネットワーク（DNN）の登場により革命的な進展を遂げてきました。音声分離は、複雑な音響環境における明瞭度を向上させ、音声認識や話者認識の重要な前処理として機能します。しかし、現在の研究は特定のアーキテクチャや孤立したアプローチに焦点を当てることが多く、断片的な理解を生み出しています。本調査はこのギャップを埋めるため、DNNベースの音声分離技術を体系的に検証します。本研究の独自性は以下の点にあります：（I）包括的視点：学習パラダイム、既知/未知の話者を対象とした分離シナリオ、教師あり/自己教師あり/教師なしフレームワークの比較分析、エンコーダから推定戦略までのアーキテクチャ構成要素を体系的に調査します。（II）最新性：最先端の開発をカバーすることで、現在のイノベーションとベンチマークへのアクセスを保証します。（III）独自の洞察：単なる要約を超え、技術の軌跡を評価し、新興パターンを特定し、ドメインロバストなフレームワーク、効率的なアーキテクチャ、マルチモーダル統合、新しい自己教師ありパラダイムなどの有望な方向性を強調します。（IV）公平な評価：標準データセットでの定量的評価を提供し、異なる手法の真の能力と限界を明らかにします。この包括的な調査は、音声分離の複雑な状況をナビゲートする経験豊富な研究者と新規参入者にとって、アクセス可能なリファレンスとして役立ちます。

English

The field of speech separation, addressing the "cocktail party problem", has seen revolutionary advances with DNNs. Speech separation enhances clarity in complex acoustic environments and serves as crucial pre-processing for speech recognition and speaker recognition. However, current literature focuses narrowly on specific architectures or isolated approaches, creating fragmented understanding. This survey addresses this gap by providing systematic examination of DNN-based speech separation techniques. Our work differentiates itself through: (I) Comprehensive perspective: We systematically investigate learning paradigms, separation scenarios with known/unknown speakers, comparative analysis of supervised/self-supervised/unsupervised frameworks, and architectural components from encoders to estimation strategies. (II) Timeliness: Coverage of cutting-edge developments ensures access to current innovations and benchmarks. (III) Unique insights: Beyond summarization, we evaluate technological trajectories, identify emerging patterns, and highlight promising directions including domain-robust frameworks, efficient architectures, multimodal integration, and novel self-supervised paradigms. (IV) Fair evaluation: We provide quantitative evaluations on standard datasets, revealing true capabilities and limitations of different methods. This comprehensive survey serves as an accessible reference for experienced researchers and newcomers navigating speech separation's complex landscape.