BatonVoice:一種操作主義框架,用於增強基於大型語言模型語言智能的可控語音合成
BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs
September 30, 2025
作者: Yue Wang, Ruotian Ma, Xingyu Chen, Zhengliang Shi, Wanshun Chen, Huang Liu, Jiadi Yao, Qu Yang, Qingxuan Jiang, Fanghua Ye, Juntao Li, Min Zhang, Zhaopeng Tu, Xiaolong Li, Linus
cs.AI
摘要
大型语言模型(LLMs)的兴起正在重塑多模态模型,其中语音合成是一个显著的应用领域。然而,现有方法往往未能充分利用这些模型的语言智能,通常未能发挥其强大的指令跟随能力。这一限制阻碍了模型在可控文本到语音(TTS)中遵循文本指令的能力。为解决这一问题,我们提出了一种受“操作主义”启发的新范式,将指令理解与语音生成解耦。我们引入了BatonVoice框架,其中LLM充当“指挥者”,理解用户指令并生成文本“计划”——明确的声学特征(如音高、能量)。随后,一个独立的TTS模型,即“乐团”,根据这些特征生成语音。为实现这一组件,我们开发了BatonTTS,一个专门为此任务训练的TTS模型。我们的实验表明,BatonVoice在可控和情感语音合成方面表现出色,超越了强大的开源和闭源基线。值得注意的是,我们的方法实现了显著的零样本跨语言泛化能力,能够准确地将特征控制能力应用于后训练期间未见过的语言。这表明将语音对象化为文本声学特征可以更有效地释放LLMs的语言智能。
English
The rise of Large Language Models (LLMs) is reshaping multimodel models, with
speech synthesis being a prominent application. However, existing approaches
often underutilize the linguistic intelligence of these models, typically
failing to leverage their powerful instruction-following capabilities. This
limitation hinders the model's ability to follow text instructions for
controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm
inspired by ``operationalism'' that decouples instruction understanding from
speech generation. We introduce BatonVoice, a framework where an LLM acts as a
``conductor'', understanding user instructions and generating a textual
``plan'' -- explicit vocal features (e.g., pitch, energy). A separate TTS
model, the ``orchestra'', then generates the speech from these features. To
realize this component, we develop BatonTTS, a TTS model trained specifically
for this task. Our experiments demonstrate that BatonVoice achieves strong
performance in controllable and emotional speech synthesis, outperforming
strong open- and closed-source baselines. Notably, our approach enables
remarkable zero-shot cross-lingual generalization, accurately applying feature
control abilities to languages unseen during post-training. This demonstrates
that objectifying speech into textual vocal features can more effectively
unlock the linguistic intelligence of LLMs.