论口语模型评估中全局词元困惑度的谬误
On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation
January 9, 2026
作者: Jeff Chan-Jan Sju, Liang-Hsuan Tseng, Yi-Cheng Lin, Yen-Chun Kuo, Ju-Chieh Chou, Kai-Wei Chang, Hung-yi Lee, Carlos Busso
cs.AI
摘要
基于大规模原始音频预训练的生成式口语语言模型能够延续语音提示的内容,同时保持说话者与情感等属性,成为口语对话的基础模型。现有研究常采用"全局标记困惑度"进行评估,该方法直接将文本困惑度的计算方式套用于语音标记。然而,这种做法忽视了语音与文本模态的本质差异,可能导致语音特性被低估。本研究提出一系列基于似然估计和生成能力的评估方法,以替代简单的全局标记困惑度。实验表明,新评估方法能更真实地反映生成语音的感知质量,其与人工评定的平均意见得分(MOS)具有更强的相关性。在新指标评估下,口语语言模型的性能对比格局被重塑:最佳模型与人类表现上限之间的差距显著缩小。这些结果表明,采用合适的评估方法对准确衡量口语语言建模的发展至关重要。
English
Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.