BatonVoice: 대규모 언어 모델의 언어적 지능을 활용한 제어 가능한 음성 합성 강화를 위한 운영주의적 프레임워크

초록

대형 언어 모델(LLMs)의 부상은 멀티모달 모델을 재구성하고 있으며, 이 중 음성 합성은 두드러진 응용 분야로 자리 잡고 있습니다. 그러나 기존 접근 방식들은 종종 이러한 모델들의 언어적 지능을 충분히 활용하지 못하며, 특히 강력한 지시 수행 능력을 제대로 활용하지 못하는 경우가 많습니다. 이러한 한계는 제어 가능한 텍스트-음성 변환(TTS)을 위한 텍스트 지시를 따르는 모델의 능력을 저해합니다. 이를 해결하기 위해, 우리는 "조작주의"에서 영감을 받아 지시 이해와 음성 생성을 분리하는 새로운 패러다임을 제안합니다. 우리는 BatonVoice라는 프레임워크를 소개하는데, 여기서 LLM은 "지휘자" 역할을 하여 사용자 지시를 이해하고 명시적인 음성 특성(예: 피치, 에너지)을 포함한 텍스트 "계획"을 생성합니다. 별도의 TTS 모델인 "오케스트라"는 이러한 특성들로부터 음성을 생성합니다. 이 구성 요소를 구현하기 위해, 우리는 이 작업에 특화된 BatonTTS라는 TTS 모델을 개발했습니다. 우리의 실험 결과, BatonVoice는 제어 가능하고 감정적인 음성 합성에서 강력한 성능을 보이며, 강력한 오픈소스 및 클로즈드소스 베이스라인들을 능가합니다. 특히, 우리의 접근 방식은 사후 훈련 중에 보지 못한 언어들에 대해 특징 제어 능력을 정확하게 적용하는 놀라운 제로샷 교차 언어 일반화를 가능하게 합니다. 이는 음성을 텍스트 기반 음성 특성으로 객관화함으로써 LLM의 언어적 지능을 더 효과적으로 활용할 수 있음을 보여줍니다.

English

The rise of Large Language Models (LLMs) is reshaping multimodel models, with speech synthesis being a prominent application. However, existing approaches often underutilize the linguistic intelligence of these models, typically failing to leverage their powerful instruction-following capabilities. This limitation hinders the model's ability to follow text instructions for controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm inspired by ``operationalism'' that decouples instruction understanding from speech generation. We introduce BatonVoice, a framework where an LLM acts as a ``conductor'', understanding user instructions and generating a textual ``plan'' -- explicit vocal features (e.g., pitch, energy). A separate TTS model, the ``orchestra'', then generates the speech from these features. To realize this component, we develop BatonTTS, a TTS model trained specifically for this task. Our experiments demonstrate that BatonVoice achieves strong performance in controllable and emotional speech synthesis, outperforming strong open- and closed-source baselines. Notably, our approach enables remarkable zero-shot cross-lingual generalization, accurately applying feature control abilities to languages unseen during post-training. This demonstrates that objectifying speech into textual vocal features can more effectively unlock the linguistic intelligence of LLMs.