연속 잠재 공간에서 에너지 거리를 통한 효율적 음성 언어 모델링

초록

우리는 SLED라는 새로운 음성 언어 모델링 접근 방식을 소개합니다. 이 방법은 음성 파형을 연속적인 잠재 표현 시퀀스로 인코딩하고, 이를 에너지 거리 목적 함수를 사용해 자기회귀적으로 모델링합니다. 에너지 거리는 시뮬레이션된 샘플과 목표 샘플을 대조함으로써 분포 간의 차이를 분석적으로 측정하며, 이를 통해 기저에 있는 연속적인 자기회귀 분포를 효과적으로 학습할 수 있습니다. 잔차 벡터 양자화에 대한 의존성을 우회함으로써, SLED는 이산화 오류를 피하고 기존 음성 언어 모델에서 흔히 사용되는 복잡한 계층적 아키텍처의 필요성을 제거합니다. 이는 전체 모델링 파이프라인을 단순화하면서도 음성 정보의 풍부함을 유지하고 추론 효율성을 유지합니다. 실험 결과는 SLED가 제로샷 및 스트리밍 음성 합성 모두에서 강력한 성능을 달성함을 보여주며, 이는 일반 목적의 음성 언어 모델에서의 광범위한 적용 가능성을 시사합니다.

English

We introduce SLED, an alternative approach to speech language modeling by encoding speech waveforms into sequences of continuous latent representations and modeling them autoregressively using an energy distance objective. The energy distance offers an analytical measure of the distributional gap by contrasting simulated and target samples, enabling efficient training to capture the underlying continuous autoregressive distribution. By bypassing reliance on residual vector quantization, SLED avoids discretization errors and eliminates the need for the complicated hierarchical architectures common in existing speech language models. It simplifies the overall modeling pipeline while preserving the richness of speech information and maintaining inference efficiency. Empirical results demonstrate that SLED achieves strong performance in both zero-shot and streaming speech synthesis, showing its potential for broader applications in general-purpose speech language models.

연속 잠재 공간에서 에너지 거리를 통한 효율적 음성 언어 모델링

Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

초록

Support