Agent-BRACE: 在長時域任務中透過語意化狀態不確定性將信念與行動解耦

摘要

大型語言模型（LLM）日益部署於需在部分可觀測環境中執行的長期任務，此時模型必須邊行動邊推論並追蹤多步驟下的複雜環境狀態。這帶來兩項挑戰：部分可觀測性要求對未觀測的世界屬性維持不確定性；長期互動歷史則導致上下文無限增長，稀釋任務相關資訊。針對這兩項挑戰，一個有原則的解決方案是信念狀態：根據過去觀測與行動給出的環境狀態後驗分布，此分布能緊湊編碼決策所需的歷史資訊，無論情節長度為何。然而，在 LLM 智能體中，文字的開放式特性使得如何表徵此分布變得不明確。為此，我們提出 Agent-BRACE：透過抽象化與信心估計之智能體信念狀態表徵（Agent Belief state Representation via Abstraction and Confidence Estimation），此方法將 LLM 智能體解耦為信念狀態模型與策略模型，並透過強化學習共同優化。信念狀態模型產生信念分布的結構化近似：一組有關環境的原子自然語言宣稱，每個宣稱皆標註從「確定」到「未知」的序數式語言化信心標籤。策略模型則以這套緊湊、結構化的近似信念為條件，而非完整歷史，學習在明確不確定性下選擇行動。在長期、部分可觀測的具身語言環境中，Agent-BRACE 達成了平均絕對提升 +14.5%（Qwen2.5-3B-Instruct）與 +5.3%（Qwen3-4B-Instruct），優於強基線強化學習方法，同時維持與情節長度無關的近乎固定上下文視窗。進一步分析顯示，隨著情節進展與證據累積，學習到的信念狀態逐漸校準。

English

Large language models (LLMs) are increasingly deployed on long-horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task-relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open-ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent-BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long-horizon, partially observable embodied language environments, Agent-BRACE achieves an average absolute improvement of +14.5% (Qwen2.5-3B-Instruct) and +5.3% (Qwen3-4B-Instruct), outperforming strong RL baselines while maintaining a near-constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.

Agent-BRACE: 在長時域任務中透過語意化狀態不確定性將信念與行動解耦

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

摘要

Support