MOSS-语音：迈向无需文本引导的真正语音到语音模型

摘要

语音对话系统通常依赖于级联的流水线，包括语音转录、处理和重新合成。虽然这种设计行之有效，但它舍弃了副语言线索，限制了表达的丰富性。近期的端到端方法降低了延迟，更好地保留了这些线索，但仍依赖于文本作为中间媒介，形成了根本性的瓶颈。我们提出了MOSS-Speech，一个真正的语音到语音大语言模型，它无需文本引导即可直接理解并生成语音。我们的方法结合了基于模态的分层架构与冻结预训练策略，在保留预训练文本大语言模型推理能力和知识的同时，增添了原生语音处理能力。实验表明，我们的模型在语音问答任务中达到了最先进的水平，在语音到语音转换性能上与现有的文本引导系统相当，同时在文本处理上仍保持竞争力。通过缩小文本引导与直接语音生成之间的差距，我们的研究为表达丰富且高效的端到端语音交互确立了新范式。

English

Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.

MOSS-语音：迈向无需文本引导的真正语音到语音模型

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

摘要

Support