음성 우선 국가의 선호도: 인도 언어 TTS에 대한 대규모 쌍별 평가 및 선호도 분석

초록

크라우드소싱 기반 쌍별 평가는 파운데이션 모델 평가를 위한 확장 가능한 접근법으로 부상했다. 그러나 이를 텍스트 음성 변환(TTS)에 적용할 경우 언어적 다양성과 음성 인지의 다차원적 특성으로 인해 높은 변동성이 발생한다. 본 논문은 언어적 통제를 지각 기반 주석과 결합한 다국어 TTS용 통제 다차원 쌍별 평가 프레임워크를 제시한다. 10개 인도 언어의 5,000개 이상의 모국어 및 코드 혼합 문장을 활용하여 7개의 최신 TTS 시스템을 평가하고, 1,900명 이상의 모국어 평가자로부터 12만 건 이상의 쌍별 비교 데이터를 수집했다. 평가자는 전반적 선호도 외에도 명료성, 표현력, 음질, 생동감, 노이즈, 환각 등 6가지 지각 차원에 대한 판단을 제공했다. 브래들리-테리 모델링을 통해 다국어 리더보드를 구축하고, SHAP 분석을 이용한 인간 선호도 해석을 수행하며, 지각 차원별 모델 강점과 트레이드오프 분석과 함께 리더보드 신뢰도를 분석한다.

English

Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.

음성 우선 국가의 선호도: 인도 언어 TTS에 대한 대규모 쌍별 평가 및 선호도 분석

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

초록

Support