MOSS-TTS 기술 보고서

초록

본 기술 보고서는 이산 오디오 토큰, 자기회귀 모델링, 대규모 사전 학습이라는 확장 가능한 레시피를 기반으로 구축된 음성 생성 파운데이션 모델인 MOSS-TTS를 소개합니다. 가변 비트레이트 RVQ와 통합 의미-음향 표현을 통해 24kHz 오디오를 12.5fps로 압축하는 인과적 트랜스포머 토크나이저인 MOSS-Audio-Tokenizer를 기반으로, 우리는 두 가지 상호 보완적인 생성기를 공개합니다: 구조적 단순성, 확장성, 장문 컨텍스트/제어 중심 배포를 중시하는 MOSS-TTS와, 더 높은 모델링 효율성, 강화된 화자 보존, 더 짧은 최초 오디오 출력 시간을 위해 프레임-로컬 자기회귀 모듈을 도입한 MOSS-TTS-Local-Transformer입니다. 다국어 및 개방형 도메인 설정에서 MOSS-TTS는 제로샷 음성 복제, 토큰 수준 속도 제어, 음소/병음 수준 발음 제어, 부드러운 코드 전환, 안정적인 장문 생성을 지원합니다. 본 보고서는 공개 모델들의 설계, 학습 레시피 및 경험적 특성을 요약합니다.

English

This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.

MOSS-TTS 기술 보고서

MOSS-TTS Technical Report

초록

Support