BatonVoice: 大規模言語モデルからの言語的知能を活用した制御可能な音声合成のための操作主義的フレームワーク

要旨

大規模言語モデル（LLMs）の台頭は、マルチモーダルモデルを再構築しており、音声合成はその代表的な応用分野の一つである。しかし、既存のアプローチでは、これらのモデルの言語的知能を十分に活用しておらず、特に強力な指示追従能力を活かしきれていないことが多い。この制約により、制御可能なテキスト読み上げ（TTS）のためのテキスト指示に従う能力が妨げられている。この問題を解決するため、我々は「操作主義」に着想を得た新しいパラダイムを提案する。このパラダイムでは、指示の理解と音声生成を分離する。我々はBatonVoiceというフレームワークを導入し、LLMが「指揮者」としてユーザーの指示を理解し、明示的な音声特徴（例：ピッチ、エネルギー）を含むテキスト「計画」を生成する。別個のTTSモデルである「オーケストラ」が、これらの特徴から音声を生成する。このコンポーネントを実現するため、我々はBatonTTSを開発した。これはこのタスクに特化して訓練されたTTSモデルである。実験の結果、BatonVoiceは制御可能かつ感情豊かな音声合成において優れた性能を発揮し、強力なオープンソースおよびクローズドソースのベースラインを上回ることが示された。特に、我々のアプローチは、ポストトレーニング中に見られなかった言語に対しても特徴制御能力を正確に適用するという、顕著なゼロショットのクロスリンガル汎化を可能にした。これは、音声をテキスト的な音声特徴として客観化することが、LLMsの言語的知能をより効果的に引き出すことを示している。

English

The rise of Large Language Models (LLMs) is reshaping multimodel models, with speech synthesis being a prominent application. However, existing approaches often underutilize the linguistic intelligence of these models, typically failing to leverage their powerful instruction-following capabilities. This limitation hinders the model's ability to follow text instructions for controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm inspired by ``operationalism'' that decouples instruction understanding from speech generation. We introduce BatonVoice, a framework where an LLM acts as a ``conductor'', understanding user instructions and generating a textual ``plan'' -- explicit vocal features (e.g., pitch, energy). A separate TTS model, the ``orchestra'', then generates the speech from these features. To realize this component, we develop BatonTTS, a TTS model trained specifically for this task. Our experiments demonstrate that BatonVoice achieves strong performance in controllable and emotional speech synthesis, outperforming strong open- and closed-source baselines. Notably, our approach enables remarkable zero-shot cross-lingual generalization, accurately applying feature control abilities to languages unseen during post-training. This demonstrates that objectifying speech into textual vocal features can more effectively unlock the linguistic intelligence of LLMs.

BatonVoice: 大規模言語モデルからの言語的知能を活用した制御可能な音声合成のための操作主義的フレームワーク

BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs

要旨

Support