PSP：面向印度语种文本转语音的逐维度可解释口音评测基准

摘要

标准语音合成评估主要关注可懂度（词错误率WER、字错误率CER）和整体自然度（平均意见得分MOS、UTMOS），但无法量化口音问题。现有系统可能在这四项指标表现优异，却在目标语言具有音位价值的特征上呈现非母语感。针对印度语言，这类特征包括卷舌发音、送气音、元音长度以及泰米尔语的卷舌近音（字母zha）。我们提出PSP（音素替换剖面）——一个可解释的、按音系维度划分的印度语言TTS口音评测基准。PSP将口音分解为六个互补维度：卷舌音坍缩率（RR）、送气音保真度（AF）、元音长度保真度（LF）、泰米尔语zha音保真度（ZF）、弗雷歇音频距离（FAD）和韵律特征差异度（PSD）。前四项通过强制对齐结合基于Wav2Vec2-XLS-R第9层嵌入的母语者声学中心点进行测量；后两项为语料库级分布距离。在本版v1中，我们针对印地语、泰卢固语和泰米尔语试点集评测了四个商业及开源系统（ElevenLabs v3、Cartesia Sonic-3、Sarvam Bulbul、Indic Parler-TTS），并在三语种中额外加入第五个系统（Praxy Voice），同时包含泰卢固语的R5→R6案例研究。主要发现：（1）卷舌音坍缩率随音系难度单调递增：印地语<泰卢固语<泰米尔语（约1%、40%、68%）；（2）PSP排序与WER排序存在差异——商业系统的WER领先优势未在卷舌音或韵律保真度上保持一致；（3）无一系统能在所有六个维度实现帕累托最优。我们公开了母语参考声学中心点（每语言500条音频）、用于FAD的1000条嵌入向量、用于PSD的500条韵律特征矩阵、每语言300条黄金测试集、MIT许可的评分代码及CC-BY许可的声学中心点。正式MOS相关性分析将留待v2版，v1版报告了五项内部一致性信号及母语音频验证结果。

English

Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages, these features include retroflex articulation, aspiration, vowel length, and the Tamil retroflex approximant (letter zha). We present PSP, the Phoneme Substitution Profile, an interpretable, per-phonological-dimension accent benchmark for Indic TTS. PSP decomposes accent into six complementary dimensions: retroflex collapse rate (RR), aspiration fidelity (AF), vowel-length fidelity (LF), Tamil-zha fidelity (ZF), Frechet Audio Distance (FAD), and prosodic signature divergence (PSD). The first four are measured via forced alignment plus native-speaker-centroid acoustic probes over Wav2Vec2-XLS-R layer-9 embeddings; the latter two are corpus-level distributional distances. In this v1 we benchmark four commercial and open-source systems (ElevenLabs v3, Cartesia Sonic-3, Sarvam Bulbul, Indic Parler-TTS) on Hindi, Telugu, and Tamil pilot sets, with a fifth system (Praxy Voice) included on all three languages, plus an R5->R6 case study on Telugu. Three findings: (i) retroflex collapse grows monotonically with phonological difficulty Hindi < Telugu < Tamil (~1%, ~40%, ~68%); (ii) PSP ordering diverges from WER ordering -- commercial WER-leaders do not uniformly lead on retroflex or prosodic fidelity; (iii) no single system is Pareto-optimal across all six dimensions. We release native reference centroids (500 clips per language), 1000-clip embeddings for FAD, 500-clip prosodic feature matrices for PSD, 300-utterance golden sets per language, scoring code under MIT, and centroids under CC-BY. Formal MOS-correlation is deferred to v2; v1 reports five internal-consistency signals plus a native-audio sanity check.