MOSS-TTS 技術報告書

要旨

本技術報告は、MOSS-TTSを紹介する。これは、離散音声音響トークン、自己回帰モデリング、大規模事前学習というスケーラブルな設計手法に基づいて構築された音声生成基盤モデルである。24kHzの音声を可変ビットレートRVQと統合意味・音響表現を用いて12.5fpsに圧縮する因果的TransformerトークナイザーであるMOSS-Audio-Tokenizerを基盤とし、2つの相補的な生成モデルを公開する：構造の単純さ、スケーラビリティ、長文脈・制御指向の展開を重視するMOSS-TTSと、より高いモデリング効率、強力な話者維持、短い初音発声時間を実現するためにフレームローカル自己回帰モジュールを導入したMOSS-TTS-Local-Transformerである。多言語およびオープンドメイン設定において、MOSS-TTSはゼロショット音声クローニング、トークンレベルの長さ制御、音素・ピンインレベルの発音制御、滑らかなコードスイッチング、安定した長文生成をサポートする。本報告は、公開モデルの設計、訓練手法、および実験的特徴をまとめたものである。

English

This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.

MOSS-TTS 技術報告書

MOSS-TTS Technical Report

要旨

Support