BatonVoice：一种基于大语言模型语言智能增强可控语音合成的操作主义框架

摘要

大型语言模型（LLMs）的兴起正在重塑多模态模型，其中语音合成是一个显著的应用领域。然而，现有方法往往未能充分利用这些模型的语言智能，通常未能发挥其强大的指令跟随能力。这一局限阻碍了模型在可控文本到语音（TTS）中遵循文本指令的能力。为解决这一问题，我们提出了一种受“操作主义”启发的新范式，将指令理解与语音生成解耦。我们引入了BatonVoice框架，其中LLM充当“指挥”，理解用户指令并生成一个文本“计划”——明确的声学特征（如音高、能量）。随后，一个独立的TTS模型，即“乐团”，根据这些特征生成语音。为实现这一组件，我们开发了BatonTTS，一个专门为此任务训练的TTS模型。实验表明，BatonVoice在可控和情感语音合成方面表现出色，超越了强大的开源和闭源基线。值得注意的是，我们的方法实现了显著的零样本跨语言泛化能力，能够准确地将特征控制能力应用于后训练期间未见过的语言。这表明，将语音对象化为文本声学特征能更有效地释放LLMs的语言智能。

English

The rise of Large Language Models (LLMs) is reshaping multimodel models, with speech synthesis being a prominent application. However, existing approaches often underutilize the linguistic intelligence of these models, typically failing to leverage their powerful instruction-following capabilities. This limitation hinders the model's ability to follow text instructions for controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm inspired by ``operationalism'' that decouples instruction understanding from speech generation. We introduce BatonVoice, a framework where an LLM acts as a ``conductor'', understanding user instructions and generating a textual ``plan'' -- explicit vocal features (e.g., pitch, energy). A separate TTS model, the ``orchestra'', then generates the speech from these features. To realize this component, we develop BatonTTS, a TTS model trained specifically for this task. Our experiments demonstrate that BatonVoice achieves strong performance in controllable and emotional speech synthesis, outperforming strong open- and closed-source baselines. Notably, our approach enables remarkable zero-shot cross-lingual generalization, accurately applying feature control abilities to languages unseen during post-training. This demonstrates that objectifying speech into textual vocal features can more effectively unlock the linguistic intelligence of LLMs.