时间推理R1：迈向大语言模型中的全面时序理解

摘要

大型语言模型（LLMs）展现了令人瞩目的能力，但在时间智能方面却显得薄弱，难以将过去事件的推理与未来预测及合理生成相结合。现有方法通常针对孤立的时间技能，如关于过去事件的问答或基础预测，且泛化能力较差，尤其是在处理超出其知识截止点或需要创造性前瞻的事件时。为应对这些局限，我们推出了Time-R1，这是首个赋予中等规模（30亿参数）LLM全面时间能力的框架：理解、预测及创造性生成。我们的方法采用了一种新颖的三阶段发展路径；前两阶段构成了由精心设计的动态规则奖励系统驱动的强化学习（RL）课程。该框架逐步构建了（1）基于历史数据的基础时间理解与逻辑事件-时间映射，（2）超越知识截止点的未来事件预测能力，最终（3）实现了无需微调即可在创造性未来场景生成上的显著泛化。引人注目的是，实验表明，Time-R1在极具挑战性的未来事件预测和创造性场景生成基准测试中，超越了规模超过其200倍的模型，包括最先进的6710亿参数DeepSeek-R1。这项工作有力证明了，经过精心设计、渐进式的RL微调，能让更小、高效的模型在时间性能上达到卓越，为真正具备时间感知的AI提供了一条实用且可扩展的路径。为促进进一步研究，我们还发布了Time-Bench，一个基于十年新闻数据的大规模多任务时间推理数据集，以及我们的Time-R1系列检查点。

English

Large Language Models (LLMs) demonstrate impressive capabilities but lack robust temporal intelligence, struggling to integrate reasoning about the past with predictions and plausible generations of the future. Meanwhile, existing methods typically target isolated temporal skills, such as question answering about past events or basic forecasting, and exhibit poor generalization, particularly when dealing with events beyond their knowledge cutoff or requiring creative foresight. To address these limitations, we introduce Time-R1, the first framework to endow a moderate-sized (3B-parameter) LLM with comprehensive temporal abilities: understanding, prediction, and creative generation. Our approach features a novel three-stage development path; the first two constitute a reinforcement learning (RL) curriculum driven by a meticulously designed dynamic rule-based reward system. This framework progressively builds (1) foundational temporal understanding and logical event-time mappings from historical data, (2) future event prediction skills for events beyond its knowledge cutoff, and finally (3) enables remarkable generalization to creative future scenario generation without any fine-tuning. Strikingly, experiments demonstrate that Time-R1 outperforms models over 200 times larger, including the state-of-the-art 671B DeepSeek-R1, on highly challenging future event prediction and creative scenario generation benchmarks. This work provides strong evidence that thoughtfully engineered, progressive RL fine-tuning allows smaller, efficient models to achieve superior temporal performance, offering a practical and scalable path towards truly time-aware AI. To foster further research, we also release Time-Bench, a large-scale multi-task temporal reasoning dataset derived from 10 years of news data, and our series of Time-R1 checkpoints.

时间推理R1：迈向大语言模型中的全面时序理解

Time-R1: Towards Comprehensive Temporal Reasoning in LLMs

摘要

Support