VoXtream2：具备动态语速控制的全流式文本转语音系统

摘要

针对交互系统的全流程文本转语音（TTS）技术需在文本增量输入时实现极低延迟的语音启动，同时保持可控性。我们提出VoXtream2模型，这是一种具备动态语速控制功能的零样本全流程TTS系统，可在语音生成过程中实时调整语速。该模型通过时长状态的分布匹配机制与条件信号的分类器无关引导相结合，有效提升了控制能力与合成质量。提示文本掩码技术实现了无需文本的音频提示功能，消除了提示转录的需求。在标准零样本测试集和专用语速测试集上，VoXtream2以更小的模型规模和更少的训练数据，在客观指标和主观听感方面均达到与主流基线模型相当的结果。全流程运行模式下，该系统在消费级GPU上可实现4倍于实时速度的处理效率，首包延迟仅为74毫秒。

English

Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines despite a smaller model and less training data. In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.

VoXtream2：具备动态语速控制的全流式文本转语音系统

VoXtream2: Full-stream TTS with dynamic speaking rate control

摘要

Support