音声ファースト国家の嗜好：インド言語におけるTTSの大規模ペアワイズ評価と嗜好分析

要旨

クラウドソーシングによるペアワイズ評価は、基盤モデルを評価するためのスケーラブルな手法として登場した。しかし、これを音声合成（TTS）に適用する場合、言語的多様性と音声知覚の多次元性により、高い分散が生じるという課題がある。本研究では、言語的制御と知覚に基づく注釈を組み合わせた、多言語TTS向けの制御された多次元ペアワイズ評価フレームワークを提案する。10のインド系言語にわたる5,000以上のネイティブ文およびコード混合文を用いて、7つの最先端TTSシステムを評価し、1,900人以上のネイティブ評価者から12万件以上のペアワイズ比較データを収集した。評価者は総合的な嗜好性に加えて、6つの知覚的次元（明瞭度、表現力、音声品質、活気、雑音、幻聴）にわたる評価を提供した。Bradley-Terryモデルを用いて多言語リーダーボードを構築し、SHAP分析により人間の嗜好性を解釈するとともに、リーダーボードの信頼性と、各知覚次元におけるモデルの強み及びトレードオフを分析した。

English

Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.

音声ファースト国家の嗜好：インド言語におけるTTSの大規模ペアワイズ評価と嗜好分析

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

要旨

Support