MOSS-語音：邁向無文本引導的真實語音至語音模型

摘要

語音對話系統通常依賴於級聯的處理流程，這些流程包括語音的轉錄、處理和再合成。雖然這種設計有效，但它捨棄了副語言線索並限制了表達力。近期的端到端方法降低了延遲並更好地保留了這些線索，但仍依賴於文本中介，形成了根本性的瓶頸。我們提出了MOSS-Speech，這是一個真正的語音到語音大型語言模型，它能夠直接理解並生成語音，無需依賴文本指導。我們的方法結合了基於模態的層分離架構與凍結預訓練策略，既保留了預訓練文本大型語言模型的推理能力和知識，又增添了原生的語音處理能力。實驗表明，我們的模型在語音問答任務中達到了最先進的水平，並在語音到語音性能上與現有的文本指導系統相當，同時仍保持了競爭力的文本處理性能。通過縮小文本指導與直接語音生成之間的差距，我們的工作為表達力強且高效的端到端語音互動建立了新的範式。

English

Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.

MOSS-語音：邁向無文本引導的真實語音至語音模型

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

摘要

Support