MemTrain: 自己教師あり文脈記憶訓練

要旨

記憶は、長期的なインタラクションを行うLLMエージェントにとって不可欠な能力であり、長期にわたるやり取りを通じて蓄積された情報を保持・活用することを可能にする。既存の記憶エージェント手法は、通常、下流タスクに関する強化学習を用いてエンドツーエンドで訓練される。しかし、記憶集約的なシナリオ向けの高品質なアノテーション付き問題を収集するにはコストがかかり、得られる訓練データは一般的な記憶行動をカバーするのに十分な多様性を欠くことが多い。本研究では、下流タスクにおける事後訓練をより効果的に行うため、LLMエージェントのコンテキスト記憶能力を全般的に向上させる自己教師あり訓練フレームワークMemTrainを提案する。MemTrainは、ラベルなしのWikipediaコーパスに対して、連携した2つの代理タスクを導入する。(1)エンドツーエンドのマスク再構成目的：モデルが複数回の記憶更新後にマスクされたエンティティを復元することを要求し、最終的な結果の観点から記憶保持を促進する。(2)中間記憶想起目的：モデルが中間記憶状態を用いてマスクされた過去の情報を再構成することを要求し、インタラクション過程全体を通じた忠実な圧縮と記憶の完全性を促進する。これら2つの目的はGRPOを用いて共同最適化される。長文QAおよび検索ベースQAのベンチマークを用いた大規模実験により、MemTrainは異なるモデルにわたって下流の記憶集約的な推論性能を一貫して向上させ、タスク固有の直接的な事後訓練と比較して最大17.67ポイントの改善を達成することを示す。

English

Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize information accumulated across extended interactions. Existing memory-agent approaches are typically trained end-to-end with reinforcement learning on downstream tasks. However, collecting high-quality annotated problems for memory-intensive scenarios is costly, and the resulting training data often lack sufficient diversity to cover general memory behaviors. In this work, we propose MemTrain, a self-supervised training framework for generally enhancing the context-memory capability of LLM agents for more effective downstream post-training. MemTrain introduces two coupled proxy tasks over unlabeled Wikipedia corpora: (1) an end-to-end masked reconstruction objective, which requires the model to recover masked entities after multiple rounds of memory updates, thereby encouraging memory maintenance from the final outcome perspective; and (2) an intermediate memory recall objective, which requires the model to reconstruct masked historical information using intermediate memory states, encouraging faithful compression and memory completeness throughout the interaction process. The two objectives are jointly optimized using GRPO. Extensive experiments on long-text QA and search-based QA benchmarks demonstrate that MemTrain consistently improves downstream memory-intensive reasoning performance across different models, achieving gains of up to 17.67 points over direct task-specific post-training.