MOSS-語音:邁向無文本引導的真實語音至語音模型
MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance
October 1, 2025
作者: Xingjian Zhao, Zhe Xu, Qinyuan Cheng, Zhaoye Fei, Luozhijie Jin, Yang Wang, Hanfu Chen, Yaozhou Jiang, Qinghui Gao, Ke Chen, Ruixiao Li, Mingshu Chen, Ruiming Wang, Wenbo Zhang, Yiyang Zhang, Donghua Yu, Yang Gao, Xiaogui Yang, Yitian Gong, Yuanfan Xu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu
cs.AI
摘要
語音對話系統通常依賴於級聯的處理流程,這些流程包括語音的轉錄、處理和再合成。雖然這種設計有效,但它捨棄了副語言線索並限制了表達力。近期的端到端方法降低了延遲並更好地保留了這些線索,但仍依賴於文本中介,形成了根本性的瓶頸。我們提出了MOSS-Speech,這是一個真正的語音到語音大型語言模型,它能夠直接理解並生成語音,無需依賴文本指導。我們的方法結合了基於模態的層分離架構與凍結預訓練策略,既保留了預訓練文本大型語言模型的推理能力和知識,又增添了原生的語音處理能力。實驗表明,我們的模型在語音問答任務中達到了最先進的水平,並在語音到語音性能上與現有的文本指導系統相當,同時仍保持了競爭力的文本處理性能。通過縮小文本指導與直接語音生成之間的差距,我們的工作為表達力強且高效的端到端語音互動建立了新的範式。
English
Spoken dialogue systems often rely on cascaded pipelines that transcribe,
process, and resynthesize speech. While effective, this design discards
paralinguistic cues and limits expressivity. Recent end-to-end methods reduce
latency and better preserve these cues, yet still rely on text intermediates,
creating a fundamental bottleneck. We present MOSS-Speech, a true
speech-to-speech large language model that directly understands and generates
speech without relying on text guidance. Our approach combines a modality-based
layer-splitting architecture with a frozen pre-training strategy, preserving
the reasoning and knowledge of pretrained text LLMs while adding native speech
capabilities. Experiments show that our model achieves state-of-the-art results
in spoken question answering and delivers comparable speech-to-speech
performance relative to existing text-guided systems, while still maintaining
competitive text performance. By narrowing the gap between text-guided and
direct speech generation, our work establishes a new paradigm for expressive
and efficient end-to-end speech interaction.