ChatPaper.aiChatPaper

你聽懂我的意思了嗎?量化指令引導表達性文本轉語音系統中的指令感知差距

Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems

September 17, 2025
作者: Yi-Cheng Lin, Huang-Cheng Chou, Tzu-Chieh Wei, Kuan-Yu Chen, Hung-yi Lee
cs.AI

摘要

指令引导的文本转语音(ITTS)技术使用户能够通过自然语言提示控制语音生成,提供了比传统TTS更为直观的交互界面。然而,用户风格指令与听者感知之间的对齐关系仍鲜有研究。本研究首先对ITTS在两种表达维度(程度副词和分级情感强度)上的可控性进行了感知分析,并收集了关于说话者年龄和词汇级别强调属性的人类评分。为了全面揭示指令与感知之间的差距,我们提供了一个包含大规模人类评估的数据集,命名为表达性语音控制(E-VOC)语料库。此外,我们发现:(1)gpt-4o-mini-tts是最可靠的ITTS模型,在声学维度上实现了指令与生成话语之间的高度对齐。(2)所分析的5个ITTS系统倾向于生成成人语音,即便指令要求使用儿童或老年语音。(3)细粒度控制仍是一个主要挑战,表明大多数ITTS系统在解读细微差异的属性指令方面仍有大幅提升空间。
English
Instruction-guided text-to-speech (ITTS) enables users to control speech generation through natural language prompts, offering a more intuitive interface than traditional TTS. However, the alignment between user style instructions and listener perception remains largely unexplored. This work first presents a perceptual analysis of ITTS controllability across two expressive dimensions (adverbs of degree and graded emotion intensity) and collects human ratings on speaker age and word-level emphasis attributes. To comprehensively reveal the instruction-perception gap, we provide a data collection with large-scale human evaluations, named Expressive VOice Control (E-VOC) corpus. Furthermore, we reveal that (1) gpt-4o-mini-tts is the most reliable ITTS model with great alignment between instruction and generated utterances across acoustic dimensions. (2) The 5 analyzed ITTS systems tend to generate Adult voices even when the instructions ask to use child or Elderly voices. (3) Fine-grained control remains a major challenge, indicating that most ITTS systems have substantial room for improvement in interpreting slightly different attribute instructions.
PDF22September 22, 2025