Agent-BRACE: 장기적 과제에서 언어화된 상태 불확실성을 통한 신념과 행동의 분리

초록

대규모 언어 모델(LLM)이 부분 관측 가능 환경에서의 장기 과제에 점점 더 많이 활용되면서, 여러 단계에 걸쳐 복잡한 환경 상태를 추론하고 추적하는 동시에 행동해야 합니다. 이는 두 가지 문제를 야기합니다. 부분 관측 가능성에서는 관측되지 않은 세계 속성에 대한 불확실성을 유지해야 하며, 긴 상호작용 이력은 컨텍스트가 무한히 증가하여 과제 관련 정보를 희석시킵니다. 두 문제에 대한 원칙적인 해결책은 신념 상태(belief state)입니다. 이는 과거 관측과 행동이 주어졌을 때 환경 상태에 대한 사후 분포로, 에피소드 길이와 관계없이 의사 결정을 위해 이력을 간결하게 인코딩합니다. 그러나 LLM 에이전트에서는 텍스트의 개방적 특성으로 인해 이러한 분포를 어떻게 표현할지가 불분명합니다. 따라서 우리는 Agent-BRACE(Agent Belief state Representation via Abstraction and Confidence Estimation)를 소개합니다. 이 방법은 LLM 에이전트를 신념 상태 모델과 정책 모델로 분리하고, 강화 학습을 통해 공동으로 최적화합니다. 신념 상태 모델은 신념 분포의 구조화된 근사치를 생성합니다. 즉, 환경에 대한 일련의 원자적 자연어 주장(atomic natural language claim)으로, 각 주장은 '확실함'부터 '알 수 없음'까지의 순서적 언어화된 확신 레이블(ordinal verbalized certainty label)로 주석이 달려 있습니다. 정책 모델은 전체 이력 대신 이 간결하고 구조화된 근사 신념을 조건으로 하여, 명시적 불확실성 하에서 행동을 선택하는 방법을 학습합니다. 장기 지평의 부분 관측 가능한 구현 언어 환경(embodied language environment) 전반에서 Agent-BRACE는 평균 절대 개선 +14.5%(Qwen2.5-3B-Instruct) 및 +5.3%(Qwen3-4B-Instruct)를 달성하여, 에피소드 길이와 무관하게 거의 일정한 컨텍스트 창을 유지하면서 강력한 강화 학습 기준선을 능가합니다. 추가 분석 결과, 학습된 신념은 증거가 축적됨에 따라 에피소드가 진행될수록 점점 더 잘 보정(calibrated)되는 것으로 나타났습니다.

English

Large language models (LLMs) are increasingly deployed on long-horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task-relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open-ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent-BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long-horizon, partially observable embodied language environments, Agent-BRACE achieves an average absolute improvement of +14.5% (Qwen2.5-3B-Instruct) and +5.3% (Qwen3-4B-Instruct), outperforming strong RL baselines while maintaining a near-constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.

Agent-BRACE: 장기적 과제에서 언어화된 상태 불확실성을 통한 신념과 행동의 분리

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

초록

Support