游戏时刻:评估口语模型中的时间动态特性
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
September 30, 2025
作者: Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, James Glass
cs.AI
摘要
对话式口语语言模型(SLMs)正逐渐成为实时语音交互领域的一个有前景的范式。然而,它们在时间动态性方面的能力,包括对时机、节奏和同时发言的管理,仍然是对话流畅性中一个关键且尚未充分评估的挑战。为填补这一空白,我们引入了“游戏时间基准”(Game-Time Benchmark),这是一个系统评估这些时间能力的框架。受人类通过语言活动学习语言的启发,游戏时间基准包含基本的指令跟随任务和具有时间约束的高级任务,如节奏遵循和同步响应。我们对多种SLM架构的评估揭示了明显的性能差异:虽然最先进的模型在基本任务上表现良好,但许多现有系统在基本的指令跟随上仍存在困难。更为关键的是,几乎所有模型在时间约束下性能大幅下降,暴露出在时间意识和全双工交互方面的持续弱点。游戏时间基准为引导未来研究朝着更具时间意识的对话式AI发展提供了基础。演示和数据集可在我们的项目网站https://ga642381.github.io/Game-Time获取。
English
Conversational Spoken Language Models (SLMs) are emerging as a promising
paradigm for real-time speech interaction. However, their capacity of temporal
dynamics, including the ability to manage timing, tempo and simultaneous
speaking, remains a critical and unevaluated challenge for conversational
fluency. To address this gap, we introduce the Game-Time Benchmark, a framework
to systematically assess these temporal capabilities. Inspired by how humans
learn a language through language activities, Game-Time consists of basic
instruction-following tasks and advanced tasks with temporal constraints, such
as tempo adherence and synchronized responses. Our evaluation of diverse SLM
architectures reveals a clear performance disparity: while state-of-the-art
models handle basic tasks well, many contemporary systems still struggle with
fundamental instruction-following. More critically, nearly all models degrade
substantially under temporal constraints, exposing persistent weaknesses in
time awareness and full-duplex interaction. The Game-Time Benchmark provides a
foundation for guiding future research toward more temporally-aware
conversational AI. Demos and datasets are available on our project website
https://ga642381.github.io/Game-Time.