TIME：大型語言模型在現實場景中多層次時間推理的基準測試

摘要

時間推理對於大型語言模型（LLMs）理解現實世界至關重要。然而，現有研究往往忽略了時間推理在現實世界中的挑戰：(1) 密集的時間信息，(2) 快速變化的事件動態，以及(3) 社交互動中複雜的時間依賴性。為彌補這一差距，我們提出了一個多層次基準測試TIME，專為現實場景中的時間推理設計。TIME包含38,522個問答對，涵蓋3個層次和11個細分任務。該基準測試包含3個反映不同現實挑戰的子數據集：TIME-Wiki、TIME-News和TIME-Dial。我們在推理模型和非推理模型上進行了廣泛的實驗，並深入分析了不同現實場景和任務中的時間推理表現，總結了測試時擴展對時間推理能力的影響。此外，我們發布了TIME-Lite，這是一個人類標註的子集，旨在促進未來時間推理研究和標準化評估。代碼可在https://github.com/sylvain-wei/TIME獲取，數據集可在https://huggingface.co/datasets/SylvainWei/TIME下載。

English

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , and the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME .

TIME：大型語言模型在現實場景中多層次時間推理的基準測試

TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

摘要

Support