ChatPaper.aiChatPaper

MemTrain:自監督上下文記憶訓練

MemTrain: Self-Supervised Context Memory Training

June 2, 2026
作者: Ziheng Li, Xingrun Xing, Haoqing Wang, Zhi-Hong Deng, Yehui Tang
cs.AI

摘要

記憶是長時程大型語言模型代理不可或缺的能力,使其能夠保存並運用於長時間互動中累積的資訊。現有的記憶代理方法通常透過強化學習在下游任務上進行端到端訓練。然而,為記憶密集型場景收集高品質的標註問題成本高昂,且產生的訓練資料往往缺乏足夠的多樣性,無法涵蓋一般的記憶行為。本研究提出MemTrain,一個自我監督的訓練框架,旨在全面提升大型語言模型代理的上下文記憶能力,以利於更有效的下游後訓練。MemTrain在未標註的維基百科語料庫上引入兩個耦合的代理任務:(1)端到端的遮蔽重建目標,要求模型在多輪記憶更新後還原被遮蔽的實體,從而從最終結果的角度促進記憶維持;(2)中間記憶回憶目標,要求模型利用中間記憶狀態重建被遮蔽的歷史資訊,從而鼓勵在互動過程中進行忠實壓縮並保持記憶完整性。這兩個目標透過GRPO進行聯合優化。在長文本問答與基於搜索的問答基準測試上的大量實驗證明,MemTrain在不同模型中持續提升下游記憶密集型推理效能,相較於直接進行特定任務的後訓練,最高可達17.67個百分點的增益。
English
Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize information accumulated across extended interactions. Existing memory-agent approaches are typically trained end-to-end with reinforcement learning on downstream tasks. However, collecting high-quality annotated problems for memory-intensive scenarios is costly, and the resulting training data often lack sufficient diversity to cover general memory behaviors. In this work, we propose MemTrain, a self-supervised training framework for generally enhancing the context-memory capability of LLM agents for more effective downstream post-training. MemTrain introduces two coupled proxy tasks over unlabeled Wikipedia corpora: (1) an end-to-end masked reconstruction objective, which requires the model to recover masked entities after multiple rounds of memory updates, thereby encouraging memory maintenance from the final outcome perspective; and (2) an intermediate memory recall objective, which requires the model to reconstruct masked historical information using intermediate memory states, encouraging faithful compression and memory completeness throughout the interaction process. The two objectives are jointly optimized using GRPO. Extensive experiments on long-text QA and search-based QA benchmarks demonstrate that MemTrain consistently improves downstream memory-intensive reasoning performance across different models, achieving gains of up to 17.67 points over direct task-specific post-training.