エージェントが早すぎるコミットをするとき：LLMエージェントにおける時期尚早なコミットメントの診断

要旨

長期的なタスクを実行するLLMエージェントは、静かに失敗することがある。すなわち、証拠の一読解に早期に固執し、その後はその解釈を擁護することに残りの実行時間を費やす。これを早期コミットメント（premature commitment）と呼ぶ。最終回答スコアリングではこの失敗モードを捉えられない。なぜなら、回答だけを評価し、プロセスが既に安定した経路に収束しているかどうかは考慮しないからである。そこで、特定の推論ステップにおける実行間の隠れ状態の収束を「表現的コミットメント（representational commitment）」と定義し、軌跡の一貫性の早期診断指標として用いる。Llama-3.1-70B上でHotpotQAに対してReActを実行した場合、ステップ4における隠れ状態の類似度は下流の行動一貫性を予測し（r = -0.35、部分相関係数 r = -0.45）、時間方向および層方向に局所化されたシグネチャを示す。この信号はQwen-2.5-72BやPhi-3-14B、さらにStrategyQA（r = -0.83）でも再現される。ただし、この信号は正解率を追跡しない。すなわち、コミットしたが誤っている質問とコミットして正しい質問は、活性化類似度では区別できない。この境界が本主張の核心である。コミットメントはエージェントが「固着したか」を示すのであって、「正しいか」を示すものではない。実行時モニタにより、隠れ状態から一貫性のない軌跡を検出でき、AUROCは最大0.97（より厳格な分割では0.85～0.88）に達する。また、プロンプト介入によって、トークン数を一致させた対照群と比較して行動分散を28%削減しつつ、精度には統計的に有意な変化は見られない。さらに、この信号を自己無撞着計算のルーティングに利用できるかも検証したが、より困難なベンチマークでは効果は限定的であり、より単純な出力ベースのベースラインと同等であった。結果として、これは隠れたプロセス障害に対する診断手法であり、一般的な精度向上の手段ではなく、明確な限界を伴うものである。

English

Long-horizon LLM agents can fail quietly: they settle on one reading of the evidence early, then spend the rest of the run defending it. We call this premature commitment. Final-answer scoring misses the failure mode because it sees only the answer, not whether the process has already collapsed to a stable path. We define representational commitment as cross-run hidden-state convergence at a fixed reasoning step, and use it as an early diagnostic of trajectory consistency. On Llama-3.1-70B running ReAct on HotpotQA, step-4 hidden-state similarity predicts downstream behavioral consistency (r = -0.35, partial r = -0.45), with a localized temporal and layer-wise signature. The signal replicates across Qwen-2.5-72B and Phi-3-14B, and on StrategyQA (r = -0.83). It does not track correctness: committed-wrong and committed-correct questions are not separable in activation similarity. That boundary is central to the claim. Commitment tells us whether an agent has settled, not whether it is right. A runtime monitor detects inconsistent trajectories from hidden states at AUROC up to 0.97 (0.85--0.88 under a stricter split), and a prompting intervention cuts behavioral variance by 28% against a token-matched control while leaving accuracy statistically unchanged. We also test whether the signal can route self-consistency compute; on a harder benchmark it helps only modestly and is matched by a simpler output-based baseline. The result is a diagnostic for a hidden process failure, with clear limits rather than a general accuracy lever.