一つのモデル、多様なレイテンシ：多様なリアルタイムアプリケーションのための汎用音声強調

要旨

異なるリアルタイム音声アプリケーションはそれぞれ独自のレイテンシ制約を課すため、各シナリオに対して個別に学習された強調モデルが必要となることが多い。本稿では、アルゴリズムレイテンシと計算レイテンシの両方を明示的に制御可能な、汎用かつリアルタイム対応の単一音声強調モデルを提案する。アルゴリズムレイテンシは、設定可能な先読みフレームにより柔軟に調整される。様々なパディング構成による学習効率の低下を防ぐため、異なる先読み設定に対応する並列畳み込み層を導入する。計算レイテンシは早期終了機構により制御され、異なるネットワーク深度での推論を可能にする。専用モデルと柔軟なモデル間の性能ギャップを縮小するため、共有型から複数デコーダへの移行を伴う二段階学習戦略を提案する。全体として、提案フレームワークにより、個別のモデルを再学習することなく、単一モデルを多様なレイテンシ制約下で展開することが可能となる。

English

Different real-time speech applications impose distinct latency budgets, often requiring separately trained enhancement models for each scenario. In this paper, we propose a one-for-all, real-time universal speech enhancement model that provides explicit control over both algorithmic and computational latency. Algorithmic latency is flexibly adjusted via configurable look-ahead frames. To avoid learning inefficiency caused by varying padding configurations, we introduce parallel convolutional layers corresponding to different look-ahead settings. Computational latency is controlled through an early-exit mechanism, enabling inference at different network depths. To narrow the performance gap between specialized and flexible models, we propose a two-stage training strategy with a shared-to-multiple decoder transition. Overall, the proposed framework enables a single model to be deployed across diverse latency budgets without retraining separate models.