StreamVoice: リアルタイムゼロショット音声変換のためのストリーミング可能なコンテキスト対応言語モデル

要旨

近年の言語モデル（LM）の進化により、ゼロショット音声変換（VC）において印象的な性能が示されてきた。しかし、既存のLMベースのVCモデルは通常、ソースセマンティクスから音響特徴へのオフライン変換を適用しており、完全なソース音声を必要とするため、リアルタイムアプリケーションへの展開が制限されている。本論文では、任意の話者プロンプトとソース音声をリアルタイムに変換可能な、新たなストリーミングLMベースのゼロショットVCモデルであるStreamVoiceを提案する。具体的には、ストリーミング機能を実現するため、StreamVoiceは完全に因果的なコンテキスト認識LMと時間独立な音響予測器を採用し、オートリグレッションの各タイムステップでセマンティック特徴と音響特徴を交互に処理することで、完全なソース音声への依存を排除している。ストリーミング処理における不完全なコンテキストに起因する性能低下に対処するため、LMのコンテキスト認識能力を強化する2つの戦略を導入した：1）教師誘導型コンテキスト予見。教師モデルを使用して、訓練中に現在および未来のセマンティックコンテキストを要約し、欠落したコンテキストに対するモデルの予測を誘導する。2）セマンティックマスキング戦略。先行する破損したセマンティックおよび音響入力からの音響予測を促進し、コンテキスト学習能力を向上させる。特に、StreamVoiceは、将来の先読みを一切行わない初のLMベースのストリーミングゼロショットVCモデルである。実験結果は、StreamVoiceがストリーミング変換能力を維持しつつ、非ストリーミングVCシステムに匹敵するゼロショット性能を達成することを示している。

English

Recent language model (LM) advancements have showcased impressive zero-shot voice conversion (VC) performance. However, existing LM-based VC models usually apply offline conversion from source semantics to acoustic features, demanding the complete source speech, and limiting their deployment to real-time applications. In this paper, we introduce StreamVoice, a novel streaming LM-based model for zero-shot VC, facilitating real-time conversion given arbitrary speaker prompts and source speech. Specifically, to enable streaming capability, StreamVoice employs a fully causal context-aware LM with a temporal-independent acoustic predictor, while alternately processing semantic and acoustic features at each time step of autoregression which eliminates the dependence on complete source speech. To address the potential performance degradation from the incomplete context in streaming processing, we enhance the context-awareness of the LM through two strategies: 1) teacher-guided context foresight, using a teacher model to summarize the present and future semantic context during training to guide the model's forecasting for missing context; 2) semantic masking strategy, promoting acoustic prediction from preceding corrupted semantic and acoustic input, enhancing context-learning ability. Notably, StreamVoice is the first LM-based streaming zero-shot VC model without any future look-ahead. Experimental results demonstrate StreamVoice's streaming conversion capability while maintaining zero-shot performance comparable to non-streaming VC systems.

StreamVoice: リアルタイムゼロショット音声変換のためのストリーミング可能なコンテキスト対応言語モデル

StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion

要旨

Support