MOSS-TTS技术报告

摘要

本技术报告介绍了MOSS-TTS语音生成基础模型，该模型基于可扩展技术方案构建：离散音频令牌、自回归建模与大规模预训练。基于MOSS-Audio-Tokenizer（一种可将24kHz音频压缩至12.5fps的因果Transformer分词器，采用可变比特率RVQ和统一语义-声学表征），我们发布了两款互补的生成器：强调结构简洁性、可扩展性及长上下文/控制导向部署的MOSS-TTS，以及引入帧局部自回归模块以提升建模效率、增强说话人保持能力并缩短首音频生成时间的MOSS-TTS-Local-Transformer。在多语言和开放域场景下，MOSS-TTS支持零样本语音克隆、令牌级时长控制、音素/拼音级发音控制、流畅语码切换及稳定长文本生成。本报告总结了所发布模型的设计方案、训练方法及实证特性。

English

This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.

MOSS-TTS技术报告

MOSS-TTS Technical Report

摘要

Support