模型何時應該改變想法？大型語言模型中的上下文信念管理

摘要

長時間互動要求語言模型管理不斷累積的資訊：何時更新狀態、何時保留狀態，以及忽略哪些資訊。我們將此挑戰稱為「情境信念管理」（Contextual Belief Management, CBM）：在隔離與任務無關的雜訊的同時，維持與形式證據一致的預測信念狀態。為使CBM可量化，我們提出BeliefTrack，一個涵蓋規則發現與電路診斷的封閉世界基準，其中有限的信念空間與符號驗證器能夠實現精確的逐輪評估。BeliefTrack診斷出三種失敗模式：保持失敗、更新失敗與隔離失敗。在多個大型語言模型中，基礎模型展現出嚴重的CBM缺陷，而明確的信念追蹤提示僅提供有限改善。相反地，採用信念狀態獎勵的強化學習平均將失敗率降低70.9%。進一步探測揭示這些失敗背後的潛在信念狀態動態，而表徵層面的引導在兩個任務中將失敗率降低46.1%\footnote{程式碼即將於 https://github.com/zjunlp/CBM 釋出。}。

English

Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as Contextual Belief Management (CBM): maintaining a predicted belief state aligned with formal evidence while isolating task-irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief-tracking prompts provide limited gains. In contrast, reinforcement learning with belief-state rewards reduces failure rates by 70.9\% on average. Further probing reveals latent belief-state dynamics behind these failures, and representation-level steering reduces failure rates by 46.1\% across two tasks\footnote{Code is coming soon at https://github.com/zjunlp/CBM.