MOSS-Speech: テキストガイダンスなしでの真の音声間モデルに向けて

要旨

音声対話システムは、音声を転写し、処理し、再合成するカスケード型のパイプラインに依存することが多い。この設計は効果的ではあるものの、副言語的キューを捨て去り、表現力を制限してしまう。近年のエンドツーエンド手法は遅延を低減し、これらのキューをより良く保持するが、依然としてテキスト中間表現に依存しており、根本的なボトルネックを生み出している。本研究では、テキストのガイダンスに頼らずに直接音声を理解し生成する、真の音声対音声大規模言語モデルであるMOSS-Speechを提案する。我々のアプローチは、モダリティベースのレイヤ分割アーキテクチャと凍結された事前学習戦略を組み合わせることで、事前学習済みテキストLLMの推論能力と知識を保持しつつ、ネイティブな音声能力を追加する。実験の結果、本モデルは音声質問応答において最先端の結果を達成し、既存のテキストガイド型システムと比較して同等の音声対音声性能を提供しつつ、競争力のあるテキスト性能も維持していることが示された。テキストガイド型と直接音声生成のギャップを狭めることで、本研究は表現力豊かで効率的なエンドツーエンド音声インタラクションの新たなパラダイムを確立する。

English

Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.

MOSS-Speech: テキストガイダンスなしでの真の音声間モデルに向けて

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

要旨

Support