TIME: 실세계 시나리오에서 LLM의 시간적 추론을 위한 다층적 벤치마크

초록

시간적 추론은 대형 언어 모델(LLMs)이 현실 세계를 이해하는 데 있어 핵심적인 요소입니다. 그러나 기존 연구들은 시간적 추론에 대한 현실 세계의 도전 과제들을 간과해 왔습니다: (1) 집약적인 시간 정보, (2) 빠르게 변화하는 사건 역학, 그리고 (3) 사회적 상호작용에서의 복잡한 시간적 의존성. 이러한 격차를 해소하기 위해, 우리는 현실 세계 시나리오에서의 시간적 추론을 위해 설계된 다층적 벤치마크 TIME을 제안합니다. TIME은 38,522개의 질문-답변 쌍으로 구성되어 있으며, 3개의 레벨과 11개의 세분화된 하위 작업을 포함합니다. 이 벤치마크는 서로 다른 현실 세계의 도전 과제를 반영하는 3개의 하위 데이터셋, 즉 TIME-Wiki, TIME-News, 그리고 TIME-Dial을 포괄합니다. 우리는 추론 모델과 비추론 모델에 대한 광범위한 실험을 수행하였고, 다양한 현실 세계 시나리오와 작업에 걸친 시간적 추론 성능에 대한 심층 분석을 진행하였으며, 테스트 시간 스케일링이 시간적 추론 능력에 미치는 영향을 요약하였습니다. 또한, 우리는 시간적 추론 분야의 미래 연구와 표준화된 평가를 촉진하기 위해 인간이 주석을 단 하위 집합인 TIME-Lite를 공개합니다. 코드는 https://github.com/sylvain-wei/TIME에서, 데이터셋은 https://huggingface.co/datasets/SylvainWei/TIME에서 이용 가능합니다.

English

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , and the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME .

TIME: 실세계 시나리오에서 LLM의 시간적 추론을 위한 다층적 벤치마크

TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

초록

Support