PSP:面向印度語系文字轉語音的維度可解釋口音基準測試框架
PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech
April 28, 2026
作者: Venkata Pushpak Teja Menta
cs.AI
摘要
标准文本转语音(TTS)评估通常关注可懂度(WER、CER)和整体自然度(MOS、UTMOS),但未量化口音问题。一个合成器可能在所有四项指标表现优异,却在目标语言中具有音位价值的特征上呈现非母语感。对印度语言而言,这些特征包括卷舌发音、送气音、元音长度及泰米尔语卷舌近音(字母zha)。我们提出PSP(音素替换剖面)——一个可解释的、按音系维度划分的印度语TTS口音基准。PSP将口音分解为六个互补维度:卷舌音坍缩率(RR)、送气音保真度(AF)、元音长度保真度(LF)、泰米尔语zha音保真度(ZF)、弗雷谢特音频距离(FAD)及韵律特征离散度(PSD)。前四项通过强制对齐结合基于Wav2Vec2-XLS-R第9层嵌入的母语者声学中心点进行测量;后两项为语料库级分布距离。在v1版本中,我们对四款商业及开源系统(ElevenLabs v3、Cartesia Sonic-3、Sarvam Bulbul、Indic Parler-TTS)在印地语、泰卢固语和泰米尔语试点集上开展基准测试,并额外纳入第五款系统(Praxy Voice)进行三语种测试,同时包含泰卢固语R5→R6的案例研究。主要发现:(1)卷舌音坍缩率随音系难度单调递增:印地语<泰卢固语<泰米尔语(约1%、40%、68%);(2)PSP排序与WER排序存在分歧——商业系统的WER领先者未在卷舌音或韵律保真度上全面占优;(3)无一系统能在六维度上实现帕累托最优。我们开源了母语参考声学中心点(每语言500条音频)、1000条音频的FAD嵌入特征、500条音频的PSD韵律特征矩阵、每语言300条语句的黄金数据集、MIT许可的评分代码及CC-BY许可的声学中心点。正式MOS相关性分析将留待v2版本;v1版本报告了五项内部一致性信号及母语音频的完整性验证。
English
Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages, these features include retroflex articulation, aspiration, vowel length, and the Tamil retroflex approximant (letter zha). We present PSP, the Phoneme Substitution Profile, an interpretable, per-phonological-dimension accent benchmark for Indic TTS. PSP decomposes accent into six complementary dimensions: retroflex collapse rate (RR), aspiration fidelity (AF), vowel-length fidelity (LF), Tamil-zha fidelity (ZF), Frechet Audio Distance (FAD), and prosodic signature divergence (PSD). The first four are measured via forced alignment plus native-speaker-centroid acoustic probes over Wav2Vec2-XLS-R layer-9 embeddings; the latter two are corpus-level distributional distances. In this v1 we benchmark four commercial and open-source systems (ElevenLabs v3, Cartesia Sonic-3, Sarvam Bulbul, Indic Parler-TTS) on Hindi, Telugu, and Tamil pilot sets, with a fifth system (Praxy Voice) included on all three languages, plus an R5->R6 case study on Telugu. Three findings: (i) retroflex collapse grows monotonically with phonological difficulty Hindi < Telugu < Tamil (~1%, ~40%, ~68%); (ii) PSP ordering diverges from WER ordering -- commercial WER-leaders do not uniformly lead on retroflex or prosodic fidelity; (iii) no single system is Pareto-optimal across all six dimensions. We release native reference centroids (500 clips per language), 1000-clip embeddings for FAD, 500-clip prosodic feature matrices for PSD, 300-utterance golden sets per language, scoring code under MIT, and centroids under CC-BY. Formal MOS-correlation is deferred to v2; v1 reports five internal-consistency signals plus a native-audio sanity check.