ChatPaper.aiChatPaper

MemTrain:自监督上下文记忆训练

MemTrain: Self-Supervised Context Memory Training

June 2, 2026
作者: Ziheng Li, Xingrun Xing, Haoqing Wang, Zhi-Hong Deng, Yehui Tang
cs.AI

摘要

记忆是长周期LLM智能体不可或缺的能力,使其能够保存并利用在长期交互中积累的信息。现有的记忆增强型智能体方法通常通过强化学习在下游任务上进行端到端训练。然而,为记忆密集型场景收集高质量标注问题成本高昂,且由此产生的训练数据往往缺乏覆盖通用记忆行为的足够多样性。本文提出MemTrain——一种自监督训练框架,旨在全面提升LLM智能体的上下文记忆能力,从而更有效地支持下游后训练。MemTrain在无标注维基百科语料上引入两个耦合代理任务:(1)端到端掩码重建目标,要求模型在多次记忆更新后恢复被掩码实体,从而从最终结果角度促进记忆维护;(2)中间记忆召回目标,要求模型利用中间记忆状态重建被掩码的历史信息,从而在交互过程中促进忠实压缩与记忆完整性。两个目标通过GRPO联合优化。在长文本问答和基于搜索的问答基准上的大量实验表明,MemTrain能够持续提升不同模型在下游记忆密集型推理中的表现,相较于直接进行任务特定后训练,最高可获得17.67个百分点的增益。
English
Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize information accumulated across extended interactions. Existing memory-agent approaches are typically trained end-to-end with reinforcement learning on downstream tasks. However, collecting high-quality annotated problems for memory-intensive scenarios is costly, and the resulting training data often lack sufficient diversity to cover general memory behaviors. In this work, we propose MemTrain, a self-supervised training framework for generally enhancing the context-memory capability of LLM agents for more effective downstream post-training. MemTrain introduces two coupled proxy tasks over unlabeled Wikipedia corpora: (1) an end-to-end masked reconstruction objective, which requires the model to recover masked entities after multiple rounds of memory updates, thereby encouraging memory maintenance from the final outcome perspective; and (2) an intermediate memory recall objective, which requires the model to reconstruct masked historical information using intermediate memory states, encouraging faithful compression and memory completeness throughout the interaction process. The two objectives are jointly optimized using GRPO. Extensive experiments on long-text QA and search-based QA benchmarks demonstrate that MemTrain consistently improves downstream memory-intensive reasoning performance across different models, achieving gains of up to 17.67 points over direct task-specific post-training.