장기적 LLM 에이전트를 위한 메타인지 메모리 정책 최적화

초록

메모리 증강 LLM 에이전트는 상호작용 궤적을 재귀적으로 요약하여 간결한 메모리로 압축함으로써 복잡한 장기 지평 작업을 처리한다. 그러나 기존 접근법은 일반적으로 결과 기반 강화학습을 사용하여 이러한 메모리 정책을 훈련시키며, 중간 메모리 품질이 저하되는 지점을 파악하지 못한다. 상호작용이 전개됨에 따라, 모호한 재귀 요약은 점진적으로 작업 관련 정보를 폐기하고 의미적 잡음을 도입한다. 이는 신념 편차를 악화시켜 에이전트의 잠재 작업 상태 추정을 불분명하게 하고, 궁극적으로 장기 추론을 실패로 이끈다. 따라서 우리는 메모리 최적화가 단순히 궤적 수준의 성공뿐만 아니라 중간 요약에 의해 유도된 신념의 명확성에 초점을 맞춰야 한다고 주장한다. 이를 위해, 우리는 현재 메모리가 주어졌을 때 모델이 잠재 작업 상태에 대해 얼마나 불확실한지 탐색하는 자기 지도 프록시인 신념 엔트로피(Belief Entropy)를 도입한다. 이 프록시를 기반으로 메타인지 메모리 정책 최적화(MMPO)를 제안한다. MMPO는 드문 결과 기반 신호에만 의존하는 대신, 높은 인식적 불확실성을 유발하는 요약을 명시적으로 패널티로 부과함으로써 세분화된 메모리 특화 감독을 제공한다. 실험 결과, MMPO는 다양한 장기 지평 작업에서 기존 방법을 일관되게 능가하며, 175만 토큰 컨텍스트로 확장되었을 때에도 97.1%의 성능을 유지한다.

English

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.