你聽懂我的意思了嗎？量化指令引導表達性文本轉語音系統中的指令感知差距

摘要

指令引导的文本转语音（ITTS）技术使用户能够通过自然语言提示控制语音生成，提供了比传统TTS更为直观的交互界面。然而，用户风格指令与听者感知之间的对齐关系仍鲜有研究。本研究首先对ITTS在两种表达维度（程度副词和分级情感强度）上的可控性进行了感知分析，并收集了关于说话者年龄和词汇级别强调属性的人类评分。为了全面揭示指令与感知之间的差距，我们提供了一个包含大规模人类评估的数据集，命名为表达性语音控制（E-VOC）语料库。此外，我们发现：（1）gpt-4o-mini-tts是最可靠的ITTS模型，在声学维度上实现了指令与生成话语之间的高度对齐。（2）所分析的5个ITTS系统倾向于生成成人语音，即便指令要求使用儿童或老年语音。（3）细粒度控制仍是一个主要挑战，表明大多数ITTS系统在解读细微差异的属性指令方面仍有大幅提升空间。

English

Instruction-guided text-to-speech (ITTS) enables users to control speech generation through natural language prompts, offering a more intuitive interface than traditional TTS. However, the alignment between user style instructions and listener perception remains largely unexplored. This work first presents a perceptual analysis of ITTS controllability across two expressive dimensions (adverbs of degree and graded emotion intensity) and collects human ratings on speaker age and word-level emphasis attributes. To comprehensively reveal the instruction-perception gap, we provide a data collection with large-scale human evaluations, named Expressive VOice Control (E-VOC) corpus. Furthermore, we reveal that (1) gpt-4o-mini-tts is the most reliable ITTS model with great alignment between instruction and generated utterances across acoustic dimensions. (2) The 5 analyzed ITTS systems tend to generate Adult voices even when the instructions ask to use child or Elderly voices. (3) Fine-grained control remains a major challenge, indicating that most ITTS systems have substantial room for improvement in interpreting slightly different attribute instructions.