dots.tts 技術報告

摘要

我們提出 dots.tts，這是一個擁有 2B 參數的連續自回歸文本轉語音（TTS）基礎模型，在連續潛在空間中對語音進行建模。與現有的連續自回歸模型相比，我們的主要創新有三點：第一，我們訓練了一個具有多目標的 AudioVAE，以建立一個語義結構良好且有利於預測的連續語音空間；第二，我們在流匹配頭（flow-matching head）中使用全歷史條件，以保持長程一致性並減少生成過程中的漂移；第三，我們將無獎勵自我修正後訓練（reward-free self-corrective post-training）應用於流匹配頭，以進一步提升穩健性和聲學品質。在大規模多語言語料庫上訓練後，dots.tts 在 Seed-TTS-Eval 上取得了最佳平均表現，在中文、英文、中文困難測試集上分別達到 0.94%/1.30%/6.60% 的詞錯誤率（WER）以及 81.0/77.1/79.5 的相似度（SIM）分數。在其他基準測試中，dots.tts 也持續展現出開源領域的最佳性能，表現出強大的生成穩定性、語音複製能力及情感表現力。為實現高效推理，我們進一步應用了 CFG 感知的 MeanFlow 蒸餾（CFG-aware MeanFlow distillation），分別在輸出流式（output streaming）與雙流式（dual-streaming）模式下達成 85/54 毫秒的首包延遲，實現低延遲語音生成。為促進可重現研究與實際部署，我們以 Apache 2.0 授權釋出訓練與推理程式碼，以及預訓練、後訓練與 MeanFlow 蒸餾後的模型檢查點。

English

We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.