ChatPaper.aiChatPaper

TIME:面向大语言模型现实场景时序推理的多层次基准测试

TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

May 19, 2025
作者: Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang
cs.AI

摘要

时序推理对于大型语言模型(LLMs)理解现实世界至关重要。然而,现有研究忽视了时序推理在现实世界中的挑战:(1) 密集的时间信息,(2) 快速变化的事件动态,以及(3) 社交互动中复杂的时间依赖关系。为填补这一空白,我们提出了一个多层次基准测试TIME,专为现实场景下的时序推理设计。TIME包含38,522个问答对,覆盖3个层次共11个细粒度子任务。该基准测试包含三个子数据集,分别反映不同的现实挑战:TIME-Wiki、TIME-News和TIME-Dial。我们对推理模型和非推理模型进行了广泛的实验,并深入分析了不同现实场景和任务中的时序推理表现,总结了测试时扩展对时序推理能力的影响。此外,我们发布了TIME-Lite,一个经过人工标注的子集,以促进未来时序推理研究和标准化评估。代码可在https://github.com/sylvain-wei/TIME获取,数据集可在https://huggingface.co/datasets/SylvainWei/TIME获取。
English
Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , and the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME .

Summary

AI-Generated Summary

PDF22May 26, 2025