Agent-BRACE: 状態不確実性の言語化による長期タスクにおける信念と行動の分離

要旨

大規模言語モデル（LLM）は、部分観測可能な環境における長期にわたるタスクにますます導入されており、エージェントは行動しながら、複雑な環境状態を多段階にわたって推論・追跡する必要がある。これにより、2つの課題が生じる：部分観測性は観測されない世界の属性に対する不確実性を維持することを要求し、長い相互作用履歴はコンテキストを無制限に増大させ、タスク関連情報を希薄化させる。両方の課題に対する原理的な解決策は信念状態（belief state）である。すなわち、過去の観測と行動を条件とする環境状態の事後分布であり、エピソードの長さにかかわらず意思決定のために履歴をコンパクトに符号化する。しかし、LLMエージェントにおいては、テキストのオープンエンドな性質のため、そのような分布をどのように表現するかが不明確である。そこで我々はAgent-BRACE（Agent Belief state Representation via Abstraction and Confidence Estimation）を導入する。これは、LLMエージェントを信念状態モデルと政策モデルに分離し、強化学習により共同最適化する手法である。信念状態モデルは、信念分布の構造化近似、すなわち環境に関する一連の原子的な自然言語主張（それぞれに「確実」から「不明」までの範囲の順序付き言語化確信度ラベルが付与される）を生成する。政策モデルは、完全な履歴ではなく、このコンパクトかつ構造化された近似信念に基づいて条件付けられ、明示的な不確実性のもとで行動を選択することを学習する。長期にわたる部分観測可能な具現化言語環境において、Agent-BRACEは平均絶対改善率+14.5%（Qwen2.5-3B-Instruct）および+5.3%（Qwen3-4B-Instruct）を達成し、強力なRLベースラインを上回るとともに、エピソード長に依存しないほぼ一定のコンテキストウィンドウを維持する。さらなる分析により、エピソードの進行に伴い証拠が蓄積されるにつれて、学習された信念の較正が徐々に向上することが示された。

English

Large language models (LLMs) are increasingly deployed on long-horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task-relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open-ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent-BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long-horizon, partially observable embodied language environments, Agent-BRACE achieves an average absolute improvement of +14.5% (Qwen2.5-3B-Instruct) and +5.3% (Qwen3-4B-Instruct), outperforming strong RL baselines while maintaining a near-constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.

Agent-BRACE: 状態不確実性の言語化による長期タスクにおける信念と行動の分離

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

要旨

Support