SAM: 장기적 추론 에이전트를 위한 상태 적응형 메모리

초록

장기적 에이전트 추론은 사고, 도구 호출, 관찰, 부분적 결론을 포함하는 긴 상호작용 이력을 통해 대규모 언어 모델이 작동해야 함을 요구한다. 문제는 단순히 이러한 이력이 길어지는 것뿐만 아니라, 현재 의사 결정에 필요한 정보가 여러 단계에 걸쳐 흩어져 있고 나중에야 관련성을 띠게 된다는 점에 있다. 기존 접근법은 상호작용 이력을 잘라내거나, 더 짧은 대체물로 압축하거나, 선택적 부분을 검색하여 재사용함으로써 이 문제를 해결하려 하지만, 과거 상호작용에 대한 접근이 에이전트의 진화하는 상태에 따라 어떻게 적응해야 하는지를 명시적으로 모델링하지는 않는다. 우리는 대신 장기적 추론을 상태 적응형 메모리의 문제로 재정의한다. 이를 위해 우리는 상태 적응형 메모리(SAM)를 제안한다. 이는 독립적인 프레임워크로, 진행 중인 상호작용을 간결한 메모리 큐로 통합하면서도 의도 기반 회상을 위해 원시 궤적 페이지를 보존한다. 이러한 큐는 이력을 대체하는 것이 아니라, 기본 백본을 재훈련하지 않고도 에이전트가 현재 필요에 따라 시간적으로 먼 정보를 재구성할 수 있는 경량 핸들 역할을 한다. 우리는 또한 전문가 안내 감독과 강화 학습을 통해 메모리 모듈을 최적화하여 궤적 수준의 유용성에 맞춘다. BrowseComp, BrowseComp-ZH, WideSearch, HLE에서 SAM은 다양한 에이전트 백본에서 강력한 기준선을 지속적으로 능가한다. 우리의 결과는 명시적 메모리 모델링이 장기적 에이전트 추론을 위한 간단하면서도 효과적인 기초를 제공함을 시사한다.

English

Long-horizon agentic reasoning requires large language models to act over long interaction histories containing thoughts, tool calls, observations, and partial conclusions. The challenge is not merely that these histories grow long, but that information needed for the current decision may be scattered across distant steps and only become relevant later. Existing approaches address this difficulty by truncating the interaction history, compressing it into shorter surrogates, or retrieving selected parts of it for reuse, but they do not explicitly model how access to past interaction should adapt to the agent's evolving state. We instead cast long-horizon reasoning as a problem of state-adaptive memory. To this end, we propose State-Adaptive Memory~(SAM), a standalone framework that consolidates ongoing interaction into compact memory cues while preserving raw trajectory pages for intent-driven recall. These cues are not treated as replacements for history; rather, they serve as lightweight handles that allow the agent to reconstruct temporally distant information according to its current needs, without retraining the underlying backbone. We further optimize the memory module through expert-guided supervision and reinforcement learning, aligning it with trajectory-level utility. Across BrowseComp, BrowseComp-ZH, WideSearch, and HLE, SAM consistently outperforms strong baselines over diverse agent backbones. Our results suggest that explicit memory modeling provides a simple and effective foundation for long-horizon agentic reasoning.