TIME:大型語言模型在現實場景中多層次時間推理的基準測試
TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios
May 19, 2025
作者: Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang
cs.AI
摘要
時間推理對於大型語言模型(LLMs)理解現實世界至關重要。然而,現有研究往往忽略了時間推理在現實世界中的挑戰:(1) 密集的時間信息,(2) 快速變化的事件動態,以及(3) 社交互動中複雜的時間依賴性。為彌補這一差距,我們提出了一個多層次基準測試TIME,專為現實場景中的時間推理設計。TIME包含38,522個問答對,涵蓋3個層次和11個細分任務。該基準測試包含3個反映不同現實挑戰的子數據集:TIME-Wiki、TIME-News和TIME-Dial。我們在推理模型和非推理模型上進行了廣泛的實驗,並深入分析了不同現實場景和任務中的時間推理表現,總結了測試時擴展對時間推理能力的影響。此外,我們發布了TIME-Lite,這是一個人類標註的子集,旨在促進未來時間推理研究和標準化評估。代碼可在https://github.com/sylvain-wei/TIME獲取,數據集可在https://huggingface.co/datasets/SylvainWei/TIME下載。
English
Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend
the real world. However, existing works neglect the real-world challenges for
temporal reasoning: (1) intensive temporal information, (2) fast-changing event
dynamics, and (3) complex temporal dependencies in social interactions. To
bridge this gap, we propose a multi-level benchmark TIME, designed for temporal
reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3
levels with 11 fine-grained sub-tasks. This benchmark encompasses 3
sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News,
and TIME-Dial. We conduct extensive experiments on reasoning models and
non-reasoning models. And we conducted an in-depth analysis of temporal
reasoning performance across diverse real-world scenarios and tasks, and
summarized the impact of test-time scaling on temporal reasoning capabilities.
Additionally, we release TIME-Lite, a human-annotated subset to foster future
research and standardized evaluation in temporal reasoning. The code is
available at https://github.com/sylvain-wei/TIME , and the dataset is available
at https://huggingface.co/datasets/SylvainWei/TIME .