モデルはいつ信念を変えるべきか？大規模言語モデルにおける文脈的信念管理

要旨

長期にわたる対話では、言語モデルは蓄積される情報を管理する必要がある。すなわち、状態をいつ更新し、いつ保持し、何を無視するかである。我々はこの課題を文脈信念管理（CBM）として研究する。これは、タスクに関係のないノイズを分離しつつ、形式的な証拠に沿った予測信念状態を維持することを指す。CBMを測定可能にするため、我々はBeliefTrackを導入する。これはルール発見と回路診断にわたる閉世界ベンチマークであり、有限信念空間とシンボリック検証器によって正確なターンレベルの評価を可能にする。BeliefTrackは三つの失敗を診断する：滞在失敗、更新失敗、分離失敗である。複数のLLMにおいて、ベーシックモデルは深刻なCBMの失敗を示す一方、明示的な信念追跡プロンプトは限定的な改善しかもたらさない。対照的に、信念状態報酬を用いた強化学習は平均で失敗率を70.9%削減する。さらに探索調査により、これらの失敗の背後にある潜在的な信念状態のダイナミクスが明らかになり、表現レベルの操作により二つのタスクで失敗率が46.1%削減される\footnote{コードは近日公開予定: https://github.com/zjunlp/CBM}。

English

Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as Contextual Belief Management (CBM): maintaining a predicted belief state aligned with formal evidence while isolating task-irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief-tracking prompts provide limited gains. In contrast, reinforcement learning with belief-state rewards reduces failure rates by 70.9\% on average. Further probing reveals latent belief-state dynamics behind these failures, and representation-level steering reduces failure rates by 46.1\% across two tasks\footnote{Code is coming soon at https://github.com/zjunlp/CBM.