Game-Time:評估語音語言模型中的時間動態特性
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
September 30, 2025
作者: Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, James Glass
cs.AI
摘要
對話式口語語言模型(SLMs)正逐漸成為實時語音互動的一個有前景的範式。然而,其時間動態能力,包括管理時機、節奏和同時說話的能力,仍然是對話流暢性的一個關鍵且未經評估的挑戰。為解決這一問題,我們引入了「遊戲時間基準」(Game-Time Benchmark),這是一個系統性評估這些時間能力的框架。受人類通過語言活動學習語言的啟發,遊戲時間基準包括基本的指令跟隨任務和具有時間約束的高級任務,例如節奏遵守和同步回應。我們對多種SLM架構的評估揭示了明顯的性能差距:雖然最先進的模型在基本任務上表現良好,但許多當代系統仍然在基本的指令跟隨上遇到困難。更為關鍵的是,幾乎所有模型在時間約束下都顯著退化,暴露了在時間意識和全雙工互動上的持續弱點。遊戲時間基準為引導未來研究朝向更具時間意識的對話式AI提供了基礎。演示和數據集可在我們的項目網站https://ga642381.github.io/Game-Time上獲取。
English
Conversational Spoken Language Models (SLMs) are emerging as a promising
paradigm for real-time speech interaction. However, their capacity of temporal
dynamics, including the ability to manage timing, tempo and simultaneous
speaking, remains a critical and unevaluated challenge for conversational
fluency. To address this gap, we introduce the Game-Time Benchmark, a framework
to systematically assess these temporal capabilities. Inspired by how humans
learn a language through language activities, Game-Time consists of basic
instruction-following tasks and advanced tasks with temporal constraints, such
as tempo adherence and synchronized responses. Our evaluation of diverse SLM
architectures reveals a clear performance disparity: while state-of-the-art
models handle basic tasks well, many contemporary systems still struggle with
fundamental instruction-following. More critically, nearly all models degrade
substantially under temporal constraints, exposing persistent weaknesses in
time awareness and full-duplex interaction. The Game-Time Benchmark provides a
foundation for guiding future research toward more temporally-aware
conversational AI. Demos and datasets are available on our project website
https://ga642381.github.io/Game-Time.