dots.tts Technisch Rapport

Samenvatting

We presenteren dots.tts, een continu autoregressief tekst-naar-spraak (TTS) fundamentmodel met 2 miljard parameters dat spraak modelleert in een continue latente ruimte. Vergeleken met bestaande continue autoregressieve modellen zijn onze belangrijkste innovaties drievoudig. Ten eerste trainen we een AudioVAE met meerdere doelstellingen om een semantisch gestructureerde en voorspellingsvriendelijke continue spraakruimte op te bouwen. Ten tweede gebruiken we volledige-geschiedenisconditionering in de flow-matching-kop om consistentie op lange termijn te behouden en drift tijdens generatie te verminderen. Ten derde passen we beloningsvrije zelfcorrigerende post-training toe op de flow-matching-kop om de robuustheid en akoestische kwaliteit verder te verbeteren. Na training op een grootschalig meertalig corpus behaalt dots.tts de beste gemiddelde prestaties op Seed-TTS-Eval, met WER-waarden van 0,94%/1,30%/6,60% en SIM-scores van 81,0/77,1/79,5 op respectievelijk de zh/en/zh-hard-testreeksen. Op andere benchmarks toont dots.tts consequent state-of-the-art prestaties in open source, met sterke generatiestabiliteit, stemkloningscapaciteit en emotionele expressiviteit. Voor efficiënte inferentie passen we verder CFG-bewuste MeanFlow-distillatie toe, wat leidt tot spraakgeneratie met lage latentie met eerste-pakketlatenties van respectievelijk 85/54 ms in uitvoerstreaming- en dual-streamingmodi. Om reproduceerbaar onderzoek en praktische implementatie te vergemakkelijken, publiceren we de trainings- en inferentiecode, samen met de voorgetrainde, nage trainde en MeanFlow-gedistilleerde checkpoints, onder de Apache 2.0-licentie.

English

We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.