VoXtream2：具备动态语速控制功能的全流程文本转语音系统

摘要

针对交互系统的全流式文本转语音（TTS）技术需在实现极低延迟开播的同时，保持对增量到达文本的可控性。我们提出VoXtream2——一种具备动态语速控制功能的零样本全流式TTS模型，可在语音生成过程中实时调整语速。该模型通过时长状态的分布匹配机制与条件信号的分类器无关引导相结合，有效提升了可控性与合成质量。采用提示文本掩码技术实现了无需文本的音频提示功能，消除了提示转录的需求。在标准零样本基准测试及专用语速测试集上，VoXtream2以更小的模型规模和更少的训练数据，在客观指标和主观听感方面均达到与公开基线模型相当的结果。全流式运行模式下，该系统在消费级GPU上可实现4倍于实时速度的生成效率，首包延迟仅为74毫秒。

English

Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines despite a smaller model and less training data. In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.