長期的LLMエージェントのためのメタ認知記憶方策最適化

要旨

記憶拡張型LLMエージェントは、相互作用の軌跡を再帰的に要約してコンパクトな記憶とすることで、複雑な長期的タスクに取り組む。しかし、既存手法は通常、結果ベースの強化学習を用いてこれらの記憶ポリシーを訓練するため、中間段階の記憶品質が劣化する箇所を特定できない。相互作用が進むにつれて、曖昧な再帰的要約はタスク関連情報を徐々に喪失させ、意味的ノイズを導入する。これにより信念のずれが悪化し、エージェントによる潜在タスク状態の推定が不明瞭になり、最終的に長期的推論が頓挫する。そこで我々は、記憶最適化は軌跡レベルの成功のみならず、中間要約によって誘発される信念の明瞭性に焦点を当てるべきだと主張する。この目的のため、我々は信念エントロピー（Belief Entropy）を導入する。これは、現在の記憶に基づいてモデルが潜在タスク状態に対してどの程度不確実であるかを探る、自己教師ありプロキシである。このプロキシに基づき、我々はメタ認知記憶ポリシー最適化（MMPO）を提案する。MMPOは、疎な結果ベースの信号のみに依存するのではなく、高い認識的不確実性を誘発する要約を明示的に罰することで、細粒度で記憶特化型の監督を提供する。実験により、MMPOは多様な長期的タスクにおいて既存手法を一貫して上回り、トークン数175万のコンテキストに拡大しても97.1%の性能を維持することが示された。

English

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.