MOSS-TTS Technisch Rapport

Samenvatting

Dit technisch rapport presenteert MOSS-TTS, een foundation-model voor spraakgeneratie dat is gebouwd op een schaalbare methodologie: discrete audiokens, autoregressieve modellering en grootschalige voorafgaande training. Gebaseerd op MOSS-Audio-Tokenizer, een causale Transformer-tokenizer die 24 kHz-audio comprimeert naar 12,5 fps met variabele bitrate RVQ en verenigde semantisch-akoestische representaties, lanceren we twee complementaire generatoren: MOSS-TTS, dat de nadruk legt op structurele eenvoud, schaalbaarheid en inzet voor lange context/controle, en MOSS-TTS-Local-Transformer, dat een frame-lokaal autoregressief module introduceert voor hogere modellerings-efficiëntie, sterkere sprekersbehoud en een kortere tijd tot de eerste audio. In multilinguale en open-domein settings ondersteunt MOSS-TTS zero-shot stemcloning, token-level duurcontrole, foneem-/pinyin-level uitspraakcontrole, vloeiende code-switching en stabiele lange-vorm generatie. Dit rapport vat het ontwerp, de trainingsmethodologie en de empirische kenmerken van de vrijgegeven modellen samen.

English

This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.

MOSS-TTS Technisch Rapport

MOSS-TTS Technical Report

Samenvatting

Support