ChatPaper.aiChatPaper

MOSS-TTS技术报告

MOSS-TTS Technical Report

March 18, 2026
作者: Yitian Gong, Botian Jiang, Yiwei Zhao, Yucheng Yuan, Kuangwei Chen, Yaozhou Jiang, Cheng Chang, Dong Hong, Mingshu Chen, Ruixiao Li, Yiyang Zhang, Yang Gao, Hanfu Chen, Ke Chen, Songlin Wang, Xiaogui Yang, Yuqian Zhang, Kexin Huang, ZhengYuan Lin, Kang Yu, Ziqi Chen, Jin Wang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu
cs.AI

摘要

本技术报告介绍了MOSS-TTS语音生成基础模型,该模型基于可扩展技术方案构建:离散音频令牌、自回归建模与大规模预训练。基于MOSS-Audio-Tokenizer(一种可将24kHz音频压缩至12.5fps的因果Transformer分词器,采用可变比特率RVQ和统一语义-声学表征),我们发布了两款互补的生成器:强调结构简洁性、可扩展性及长上下文/控制导向部署的MOSS-TTS,以及引入帧局部自回归模块以提升建模效率、增强说话人保持能力并缩短首音频生成时间的MOSS-TTS-Local-Transformer。在多语言和开放域场景下,MOSS-TTS支持零样本语音克隆、令牌级时长控制、音素/拼音级发音控制、流畅语码切换及稳定长文本生成。本报告总结了所发布模型的设计方案、训练方法及实证特性。
English
This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.
PDF61March 21, 2026