ChatPaper.aiChatPaper

dots.tts 技術報告

dots.tts Technical Report

June 5, 2026
作者: Shi Lian, Changtao Li, Bohan Li, Hankun Wang, Da Zheng, Junfeng Tian, Yufeng Ma, Colin Zhang, Kai Yu
cs.AI

摘要

我們提出 dots.tts,這是一個擁有 2B 參數的連續自回歸文本轉語音(TTS)基礎模型,在連續潛在空間中對語音進行建模。與現有的連續自回歸模型相比,我們的主要創新有三點:第一,我們訓練了一個具有多目標的 AudioVAE,以建立一個語義結構良好且有利於預測的連續語音空間;第二,我們在流匹配頭(flow-matching head)中使用全歷史條件,以保持長程一致性並減少生成過程中的漂移;第三,我們將無獎勵自我修正後訓練(reward-free self-corrective post-training)應用於流匹配頭,以進一步提升穩健性和聲學品質。在大規模多語言語料庫上訓練後,dots.tts 在 Seed-TTS-Eval 上取得了最佳平均表現,在中文、英文、中文困難測試集上分別達到 0.94%/1.30%/6.60% 的詞錯誤率(WER)以及 81.0/77.1/79.5 的相似度(SIM)分數。在其他基準測試中,dots.tts 也持續展現出開源領域的最佳性能,表現出強大的生成穩定性、語音複製能力及情感表現力。為實現高效推理,我們進一步應用了 CFG 感知的 MeanFlow 蒸餾(CFG-aware MeanFlow distillation),分別在輸出流式(output streaming)與雙流式(dual-streaming)模式下達成 85/54 毫秒的首包延遲,實現低延遲語音生成。為促進可重現研究與實際部署,我們以 Apache 2.0 授權釋出訓練與推理程式碼,以及預訓練、後訓練與 MeanFlow 蒸餾後的模型檢查點。
English
We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.