语音优先国度的偏好:印度语言文本转语音系统的大规模成对评估与偏好分析
Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages
April 23, 2026
作者: Srija Anand, Ashwin Sankar, Ishvinder Sethi, Aaditya Pareek, Kartik Rajput, Gaurav Yadav, Nikhil Narasimhan, Adish Pandya, Deepon Halder, Mohammed Safi Ur Rahman Khan, Praveen S V, Shobhit Banga, Mitesh M Khapra
cs.AI
摘要
众包配对评估已成为评估基础模型的一种可扩展方法。然而,将其应用于文本转语音(TTS)领域时,由于语言多样性和语音感知的多维特性,会引入较高方差。我们提出了一种针对多语言TTS的受控多维配对评估框架,该框架将语言控制与基于感知的标注相结合。通过使用10种印度语言中超过5000句原生及语码混合句子,我们评估了7个前沿TTS系统,并收集了来自1900余名母语评分者超过12万组配对比较数据。除整体偏好外,评分者还需在6个感知维度提供评判:可懂度、表现力、音质、生动性、噪声和幻象。采用布拉德利-特里模型构建多语言排行榜后,我们通过SHAP分析解读人类偏好,并在分析各模型在感知维度上的优势与权衡的同时,验证了排行榜的可靠性。
English
Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.