游戏时刻：评估口语模型中的时间动态特性

摘要

对话式口语语言模型（SLMs）正逐渐成为实时语音交互领域的一个有前景的范式。然而，它们在时间动态性方面的能力，包括对时机、节奏和同时发言的管理，仍然是对话流畅性中一个关键且尚未充分评估的挑战。为填补这一空白，我们引入了“游戏时间基准”（Game-Time Benchmark），这是一个系统评估这些时间能力的框架。受人类通过语言活动学习语言的启发，游戏时间基准包含基本的指令跟随任务和具有时间约束的高级任务，如节奏遵循和同步响应。我们对多种SLM架构的评估揭示了明显的性能差异：虽然最先进的模型在基本任务上表现良好，但许多现有系统在基本的指令跟随上仍存在困难。更为关键的是，几乎所有模型在时间约束下性能大幅下降，暴露出在时间意识和全双工交互方面的持续弱点。游戏时间基准为引导未来研究朝着更具时间意识的对话式AI发展提供了基础。演示和数据集可在我们的项目网站https://ga642381.github.io/Game-Time获取。

English

Conversational Spoken Language Models (SLMs) are emerging as a promising paradigm for real-time speech interaction. However, their capacity of temporal dynamics, including the ability to manage timing, tempo and simultaneous speaking, remains a critical and unevaluated challenge for conversational fluency. To address this gap, we introduce the Game-Time Benchmark, a framework to systematically assess these temporal capabilities. Inspired by how humans learn a language through language activities, Game-Time consists of basic instruction-following tasks and advanced tasks with temporal constraints, such as tempo adherence and synchronized responses. Our evaluation of diverse SLM architectures reveals a clear performance disparity: while state-of-the-art models handle basic tasks well, many contemporary systems still struggle with fundamental instruction-following. More critically, nearly all models degrade substantially under temporal constraints, exposing persistent weaknesses in time awareness and full-duplex interaction. The Game-Time Benchmark provides a foundation for guiding future research toward more temporally-aware conversational AI. Demos and datasets are available on our project website https://ga642381.github.io/Game-Time.