ChatPaper.aiChatPaper

记忆-T1:多会话智能体中时序推理的强化学习方法

Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents

December 23, 2025
作者: Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang, Boyang Xue, Bin Liang, Xingshan Zeng, Fei Mi, Haoli Bai, Lifeng Shang, Jeff Z. Pan, Yuxin Jiang, Kam-Fai Wong
cs.AI

摘要

针对长程多轮对话的时间推理能力是对话智能体的核心需求。然而现有研究及我们的初步实验表明,当对话历史增长并累积噪声时,当前的长上下文模型难以准确识别时间相关信息,严重影响了推理性能。为此,我们提出Memory-T1框架,该框架通过强化学习训练具有时间感知能力的记忆选择策略。其采用由粗到精的双阶段处理:首先通过时间与相关性过滤器对对话历史进行剪枝生成候选集,再由强化学习智能体精准筛选证据会话。强化学习训练受多层级奖励函数引导,同步优化(i)答案准确性、(ii)证据可追溯性及(iii)时间一致性。其中时间一致性奖励通过评估会话级(时序邻近性)和语句级(时序保真度)与查询时间范围的匹配度,提供密集信号,使智能体能够解析细微的时间歧义。在Time-Dialog基准测试中,Memory-T1将70亿参数模型的综合得分提升至67.0%,创造了开源模型的新标杆,较140亿参数基线模型提升10.2%。消融实验表明时间一致性与证据可追溯性奖励共同带来15.0%的性能增益。此外,当基线模型在12.8万令牌规模下失效时,Memory-T1仍保持稳健性能,证明了其对长对话历史噪声的有效抑制。代码与数据集已开源:https://github.com/Elvin-Yiming-Du/Memory-T1/
English
Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. However, existing works and our pilot study have shown that as dialogue histories grow in length and accumulate noise, current long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set using temporal and relevance filters, followed by an RL agent that selects the precise evidence sessions. The RL training is guided by a multi-level reward function optimizing (i) answer accuracy, (ii) evidence grounding, and (iii) temporal consistency. In particular, the temporal consistency reward provides a dense signal by evaluating alignment with the query time scope at both the session-level (chronological proximity) and the utterance-level (chronological fidelity), enabling the agent to resolve subtle chronological ambiguities. On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0\%, establishing a new state-of-the-art performance for open-source models and outperforming a 14B baseline by 10.2\%. Ablation studies show temporal consistency and evidence grounding rewards jointly contribute to a 15.0\% performance gain. Moreover, Memory-T1 maintains robustness up to 128k tokens, where baseline models collapse, proving effectiveness against noise in extensive dialogue histories. The code and datasets are publicly available at https://github.com/Elvin-Yiming-Du/Memory-T1/
PDF21December 25, 2025