ChatPaper.aiChatPaper

语音优先国度的偏好:印度语言TTS的大规模成对评估与偏好分析

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

April 23, 2026
作者: Srija Anand, Ashwin Sankar, Ishvinder Sethi, Aaditya Pareek, Kartik Rajput, Gaurav Yadav, Nikhil Narasimhan, Adish Pandya, Deepon Halder, Mohammed Safi Ur Rahman Khan, Praveen S V, Shobhit Banga, Mitesh M Khapra
cs.AI

摘要

众包成对评估已成为评估基础模型的一种可扩展方法。然而,将其应用于文本转语音(TTS)领域时,由于语言多样性和语音感知的多维特性,会引入较高方差。我们提出一种受控的多维成对评估框架,用于多语言TTS评估,该框架将语言控制与基于感知的标注相结合。通过使用10种印度语言中超过5000句母语及语码混合句子,我们评估了7个最先进的TTS系统,并从1900多名母语评分者处收集了超过12万组成对比较数据。除整体偏好外,评分者还需在6个感知维度上进行评判:可懂度、表现力、音质、生动性、噪声和幻听。采用布拉德利-特里模型构建多语言排行榜,通过SHAP分析解读人类偏好,并综合感知维度分析排行榜可靠性及模型优势与权衡。
English
Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.
PDF11April 30, 2026