TIME: 現実世界シナリオにおける大規模言語モデルの時間的推論能力のための多段階ベンチマーク

要旨

時間的推論は、大規模言語モデル（LLMs）が現実世界を理解する上で極めて重要です。しかし、既存の研究は、時間的推論における現実世界の課題を無視しています。具体的には、(1) 大量の時間的情報、(2) 急速に変化するイベントのダイナミクス、(3) 社会的相互作用における複雑な時間的依存関係です。このギャップを埋めるため、私たちは現実世界のシナリオにおける時間的推論のために設計された多層ベンチマーク「TIME」を提案します。TIMEは38,522のQAペアで構成され、3つのレベルと11の細分化されたサブタスクをカバーしています。このベンチマークは、異なる現実世界の課題を反映する3つのサブデータセット（TIME-Wiki、TIME-News、TIME-Dial）を含んでいます。私たちは、推論モデルと非推論モデルに対して広範な実験を行い、多様な現実世界のシナリオやタスクにおける時間的推論の性能を詳細に分析し、テスト時のスケーリングが時間的推論能力に与える影響をまとめました。さらに、今後の研究と標準化された評価を促進するため、人間が注釈を付けたサブセット「TIME-Lite」を公開しました。コードはhttps://github.com/sylvain-wei/TIMEで、データセットはhttps://huggingface.co/datasets/SylvainWei/TIMEで利用可能です。

English

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , and the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME .

TIME: 現実世界シナリオにおける大規模言語モデルの時間的推論能力のための多段階ベンチマーク

TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

要旨

Support