VoXtream2: 동적 발화 속도 제어를 지원하는 풀-스트림 텍스트 음성 변환 시스템

초록

상호작용 시스템을 위한 전체 스트림 텍스트-음성 변환(TTS)은 최소 지연으로 음성 출력을 시작해야 하며, 점진적으로 텍스트가 도착하는 상황에서도 제어 가능성을 유지해야 합니다. 본 논문에서는 발화 중간에 실시간으로 업데이트 가능한 동적 말속도 제어 기능을 갖춘 제로샷 전체 스트림 TTS 모델인 VoXtream2를 제안합니다. VoXtream2는 지속 시간 상태에 대한 분포 매칭 메커니즘과 조건 설정 신호 간의 분류기 없는 유도 방식을 결합하여 제어성과 합성 품질을 향상시킵니다. 프롬프트 텍스트 마스킹 기법을 통해 텍스트 없이 오디오 프롬프팅이 가능하며, 프롬프트 전사 과정이 필요하지 않습니다. 표준 제로샷 벤치마크와 전용 말속도 테스트 세트에서 VoXtream2는 더 작은 모델 크기와 더 적은 학습 데이터에도 불구하고 공개 기준 모델 대비 경쟁력 있는 객관적 및 주관적 결과를 달성했습니다. 전체 스트림 모드에서 소비자용 GPU 기준 실시간 대비 4배 빠른 처리 속도와 74ms의 첫 패킷 지연 시간을 보여줍니다.

English

Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines despite a smaller model and less training data. In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.

VoXtream2: 동적 발화 속도 제어를 지원하는 풀-스트림 텍스트 음성 변환 시스템

VoXtream2: Full-stream TTS with dynamic speaking rate control

초록

Support