dots.tts 技术报告

摘要

我们介绍dots.tts，一个20亿参数的连续自回归文本转语音（TTS）基础模型，在连续潜在空间中建模语音。与现有连续自回归模型相比，我们的关键创新体现在三个方面。首先，我们通过多目标训练AudioVAE，构建了一个语义结构清晰且利于预测的连续语音空间。其次，在流匹配头部中采用全历史条件约束，以保持长程一致性并减少生成过程中的漂移。第三，对流匹配头部应用无奖励的自我纠正后训练，进一步提升鲁棒性和声学质量。在大规模多语言语料库上训练后，dots.tts在Seed-TTS-Eval上取得最佳平均性能，在zh/en/zh-hard测试集上的词错误率（WER）分别为0.94%/1.30%/6.60%，相似度分数（SIM）分别为81.0/77.1/79.5。在其他基准测试中，dots.tts也持续展现出开源领域的最优性能，具备强大的生成稳定性、声音克隆能力和情感表现力。为实现高效推理，我们进一步应用了CFG感知的MeanFlow蒸馏，使输出流和双流模式下的首包延迟分别低至85毫秒和54毫秒。为促进可重复研究和实际部署，我们在Apache 2.0许可下发布了训练与推理代码，以及预训练、后训练和MeanFlow蒸馏后的模型检查点。

English

We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.