針對長時域大型語言模型智能體的元認知記憶策略優化

摘要

記憶增強式大型語言模型智能體透過將互動軌跡遞迴歸納為緊湊記憶，來處理複雜的長時程任務。然而，現有方法通常使用基於結果的強化學習來訓練這些記憶策略，導致無法定位中間記憶品質下降之處。隨著互動展開，模糊的遞迴歸納會逐漸丟失任務相關資訊，並引入語意雜訊。這加劇了信念偏差，使智能體對潛在任務狀態的估計變得模糊不清，最終破壞長時程推理。因此，我們主張記憶最佳化不應僅關注軌跡層級的成功，更應關注中間歸納所引發的信念清晰度。為此，我們引入了信念熵（Belief Entropy），這是一種自監督代理指標，用以探測模型在當前記憶下對潛在任務狀態仍存在多少不確定性。基於此指標，我們提出後設認知記憶策略最佳化（Metacognitive Memory Policy Optimization, MMPO）。不同於僅依賴稀疏的基於結果訊號，MMPO透過明確懲罰引發高認知不確定性的歸納，提供細粒度、記憶特定的監督。實驗顯示，MMPO在各種長時程任務中持續優於現有方法，即使在擴展至175萬詞元的上下文時，仍能維持97.1%的效能。

English

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.