模型何时应改变其想法？大型语言模型中的上下文信念管理

摘要

长程交互要求语言模型管理不断积累的信息：何时更新其状态、何时保持其状态，以及应忽略哪些内容。我们将这一挑战视为上下文信念管理（CBM）问题：在隔离与任务无关的噪声的同时，维护与形式化证据对齐的预测信念状态。为了使CBM可量化，我们引入了BeliefTrack，这是一个涵盖规则发现与电路诊断的封闭世界基准测试，其有限信念空间与符号验证器能够实现精确的逐轮评估。BeliefTrack诊断出三种失败模式：保持失败、更新失败与隔离失败。在多种大语言模型上，基础模型表现出严重的CBM失败，而显式的信念追踪提示仅带来有限改进。相比之下，基于信念状态奖励的强化学习平均降低了70.9%的失败率。进一步探测揭示了这些失败背后的潜在信念状态动态，而表示层面的引导在两个任务上平均将失败率降低了46.1%\footnote{代码即将发布在 https://github.com/zjunlp/CBM。}。

English

Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as Contextual Belief Management (CBM): maintaining a predicted belief state aligned with formal evidence while isolating task-irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief-tracking prompts provide limited gains. In contrast, reinforcement learning with belief-state rewards reduces failure rates by 70.9\% on average. Further probing reveals latent belief-state dynamics behind these failures, and representation-level steering reduces failure rates by 46.1\% across two tasks\footnote{Code is coming soon at https://github.com/zjunlp/CBM.