ChatPaper.aiChatPaper

dots.tts 技术报告

dots.tts Technical Report

June 5, 2026
作者: Shi Lian, Changtao Li, Bohan Li, Hankun Wang, Da Zheng, Junfeng Tian, Yufeng Ma, Colin Zhang, Kai Yu
cs.AI

摘要

我们介绍dots.tts,一个20亿参数的连续自回归文本转语音(TTS)基础模型,在连续潜在空间中建模语音。与现有连续自回归模型相比,我们的关键创新体现在三个方面。首先,我们通过多目标训练AudioVAE,构建了一个语义结构清晰且利于预测的连续语音空间。其次,在流匹配头部中采用全历史条件约束,以保持长程一致性并减少生成过程中的漂移。第三,对流匹配头部应用无奖励的自我纠正后训练,进一步提升鲁棒性和声学质量。在大规模多语言语料库上训练后,dots.tts在Seed-TTS-Eval上取得最佳平均性能,在zh/en/zh-hard测试集上的词错误率(WER)分别为0.94%/1.30%/6.60%,相似度分数(SIM)分别为81.0/77.1/79.5。在其他基准测试中,dots.tts也持续展现出开源领域的最优性能,具备强大的生成稳定性、声音克隆能力和情感表现力。为实现高效推理,我们进一步应用了CFG感知的MeanFlow蒸馏,使输出流和双流模式下的首包延迟分别低至85毫秒和54毫秒。为促进可重复研究和实际部署,我们在Apache 2.0许可下发布了训练与推理代码,以及预训练、后训练和MeanFlow蒸馏后的模型检查点。
English
We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.