ChatPaper.aiChatPaper

MOSS-语音:迈向无需文本引导的真正语音到语音模型

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

October 1, 2025
作者: Xingjian Zhao, Zhe Xu, Qinyuan Cheng, Zhaoye Fei, Luozhijie Jin, Yang Wang, Hanfu Chen, Yaozhou Jiang, Qinghui Gao, Ke Chen, Ruixiao Li, Mingshu Chen, Ruiming Wang, Wenbo Zhang, Yiyang Zhang, Donghua Yu, Yang Gao, Xiaogui Yang, Yitian Gong, Yuanfan Xu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu
cs.AI

摘要

语音对话系统通常依赖于级联的流水线,包括语音转录、处理和重新合成。虽然这种设计行之有效,但它舍弃了副语言线索,限制了表达的丰富性。近期的端到端方法降低了延迟,更好地保留了这些线索,但仍依赖于文本作为中间媒介,形成了根本性的瓶颈。我们提出了MOSS-Speech,一个真正的语音到语音大语言模型,它无需文本引导即可直接理解并生成语音。我们的方法结合了基于模态的分层架构与冻结预训练策略,在保留预训练文本大语言模型推理能力和知识的同时,增添了原生语音处理能力。实验表明,我们的模型在语音问答任务中达到了最先进的水平,在语音到语音转换性能上与现有的文本引导系统相当,同时在文本处理上仍保持竞争力。通过缩小文本引导与直接语音生成之间的差距,我们的研究为表达丰富且高效的端到端语音交互确立了新范式。
English
Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
PDF152October 7, 2025