「あなたは私の意図を理解していますか？指示に基づく表現力豊かなテキスト読み上げシステムにおける指示と知覚のギャップの定量化」

要旨

指示文誘導型テキスト音声合成（ITTS）は、自然言語プロンプトを通じて音声生成を制御することを可能にし、従来のTTSよりも直感的なインターフェースを提供します。しかし、ユーザーのスタイル指示とリスナーの知覚との整合性は、ほとんど未解明のままです。本研究ではまず、2つの表現的次元（程度副詞と段階的な感情強度）にわたるITTSの制御性に関する知覚分析を行い、話者の年齢と単語レベルの強調属性に関する人間の評価を収集します。指示と知覚のギャップを包括的に明らかにするため、大規模な人間評価によるデータ収集を行い、Expressive VOice Control（E-VOC）コーパスを提供します。さらに、（1）gpt-4o-mini-ttsが音響次元において指示と生成された発話の整合性が最も高く、信頼性の高いITTSモデルであることを明らかにします。（2）分析された5つのITTSシステムは、指示が子供や高齢者の声を使用するよう求めている場合でも、成人の声を生成する傾向があります。（3）細かな制御は依然として主要な課題であり、ほとんどのITTSシステムがわずかに異なる属性指示を解釈する点で大幅な改善の余地があることを示しています。

English

Instruction-guided text-to-speech (ITTS) enables users to control speech generation through natural language prompts, offering a more intuitive interface than traditional TTS. However, the alignment between user style instructions and listener perception remains largely unexplored. This work first presents a perceptual analysis of ITTS controllability across two expressive dimensions (adverbs of degree and graded emotion intensity) and collects human ratings on speaker age and word-level emphasis attributes. To comprehensively reveal the instruction-perception gap, we provide a data collection with large-scale human evaluations, named Expressive VOice Control (E-VOC) corpus. Furthermore, we reveal that (1) gpt-4o-mini-tts is the most reliable ITTS model with great alignment between instruction and generated utterances across acoustic dimensions. (2) The 5 analyzed ITTS systems tend to generate Adult voices even when the instructions ask to use child or Elderly voices. (3) Fine-grained control remains a major challenge, indicating that most ITTS systems have substantial room for improvement in interpreting slightly different attribute instructions.