VoXtream2：動的発話速度制御を備えたフルストリームTTS

要旨

対話型システムにおけるフルストリーム音声合成（TTS）は、テキストが逐次的に到着する中で、最小限の遅延で発話を開始しつつ、制御性を維持しなければならない。本論文では、発話途中でも動的に更新可能な話速制御機能を備えた、ゼロショットフルストリームTTSモデル「VoXtream2」を提案する。VoXtream2は、継続時間状態に対する分布マッチング機構と、条件付け信号に対する分類器不要ガイダンスを組み合わせることで、制御性と合成品質を向上させている。プロンプトテキストマスキングにより、テキストを必要としない音声プロンプティングが可能となり、プロンプトの文字起こしが不要となる。標準的なゼロショットベンチマークおよび専用の話速テストセットによる評価では、モデルサイズが小さく訓練データも少ないにもかかわらず、公開ベースラインと比較して競争力のある客観的・主観的結果を達成した。フルストリームモードでは、消費者向けGPU上で、最初のパケットまでの遅延が74ミリ秒、リアルタイムの4倍の速度で動作する。

English

Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines despite a smaller model and less training data. In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.

VoXtream2：動的発話速度制御を備えたフルストリームTTS

VoXtream2: Full-stream TTS with dynamic speaking rate control

要旨

Support