论口语模型评估中全局词符困惑度的谬误
On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation
January 9, 2026
作者: Jeff Chan-Jan Sju, Liang-Hsuan Tseng, Yi-Cheng Lin, Yen-Chun Kuo, Ju-Chieh Chou, Kai-Wei Chang, Hung-yi Lee, Carlos Busso
cs.AI
摘要
基於大規模原始音頻預訓練的生成式口語模型能夠在保持說話者與情感等屬性的前提下,延續語音提示的內容,成為口語對話的基礎模型。既有文獻常採用「全局標記困惑度」評估此類模型,該方法直接將文本困惑度的計算公式套用於語音標記。然而此做法忽略了語音與文本模態的根本差異,可能導致語音特徵的低估。本研究提出多種基於似然估計和生成能力的評估方法,以替代樸素的全局標記困惑度。實驗證明,新評估方法能更真實地反映生成語音的感知質量,其與人工評測平均意見分數(MOS)的相關性顯著增強。在新指標體系下,口語模型的性能排名格局被重塑,最佳模型與人類基準之間的差距大幅縮小。這些結果共同表明,採用恰當的評估方法對於準確衡量口語建模進展具有關鍵意義。
English
Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.