Agent-BRACE：通过语言化状态不确定性在长期任务中将信念与行动解耦

摘要

大语言模型（LLMs）越来越多地被部署在部分可观测环境中的长周期任务上，在这些任务中，模型必须在推断并持续跟踪复杂环境状态的同时采取行动。这引发了两个挑战：部分可观测性要求对未观测到的世界属性保持不确定性，而长时间的交互历史导致上下文无限增长，稀释了任务相关信息。应对这两个挑战的一个原则性解决方案是信念状态：基于过去观测和动作的环境状态后验分布，它能够紧凑地编码历史信息以支持决策制定，不受回合长度影响。然而，在LLM智能体中，文本的开放性使得如何表示这种分布尚不明确。因此，我们提出了Agent-BRACE：通过抽象与置信度估计的智能体信念状态表示，该方法将LLM智能体解耦为信念状态模型和策略模型，并通过强化学习联合优化。信念状态模型生成信念分布的结构化近似：一组关于环境的原子自然语言声明，每一条均标注一个从确定到未知的序数言语化置信度标签。策略模型以这种紧凑、结构化的近似信念为条件，而非完整历史，学习在显式不确定性下选择动作。在长周期、部分可观测的具身语言环境中，Agent-BRACE取得了平均绝对提升+14.5%（Qwen2.5-3B-Instruct）和+5.3%（Qwen3-4B-Instruct），在超越强强化学习基线的同时，维持了与回合长度无关的近乎恒定的上下文窗口。进一步分析表明，随着回合中证据的积累，学习到的信念逐步变得更加校准。

English

Large language models (LLMs) are increasingly deployed on long-horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task-relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open-ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent-BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long-horizon, partially observable embodied language environments, Agent-BRACE achieves an average absolute improvement of +14.5% (Qwen2.5-3B-Instruct) and +5.3% (Qwen3-4B-Instruct), outperforming strong RL baselines while maintaining a near-constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.

Agent-BRACE：通过语言化状态不确定性在长期任务中将信念与行动解耦

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

摘要

Support