当智能体过早承诺时：诊断LLM智能体中的过早承诺

摘要

长视野LLM智能体可能悄无声息地失败：它们过早锁定某一证据解读，随后整个运行过程都用于维护这一解读。我们将此称为“过早确定”。最终答案评分无法捕捉此失效模式，因为它只关注答案本身，而不关注过程是否已崩溃至稳定路径。我们将“表征承诺”定义为在固定推理步骤上，跨运行的隐藏状态收敛，并将其作为轨迹一致性的早期诊断指标。在运行ReAct于HotpotQA的Llama-3.1-70B模型上，第4步隐藏状态相似性能够预测下游行为一致性（r = -0.35，偏相关 r = -0.45），且具有局部时间与逐层特征。该信号在Qwen-2.5-72B和Phi-3-14B模型以及StrategyQA基准上得以复现（r = -0.83）。但该信号并不追踪正确性：在激活相似性上，已承诺但错误的问题与已承诺且正确的问题之间无法区分。这一分界正是该论点的核心。承诺告诉我们智能体是否已确定，而非其是否正确。运行时监控器可通过隐藏状态检测不一致轨迹，AUROC高达0.97（严格拆分下为0.85–0.88），而提示干预将（与令牌匹配对照组相比）行为方差降低28%，同时准确率在统计上保持不变。我们还测试了该信号是否能引导自一致性计算；在更难的基准上，其效果仅属中等，且被基于输出的简单基线所超越。结果提供了针对隐蔽过程故障的诊断工具，具有明确边界，而非通用的准确率提升杠杆。

English

Long-horizon LLM agents can fail quietly: they settle on one reading of the evidence early, then spend the rest of the run defending it. We call this premature commitment. Final-answer scoring misses the failure mode because it sees only the answer, not whether the process has already collapsed to a stable path. We define representational commitment as cross-run hidden-state convergence at a fixed reasoning step, and use it as an early diagnostic of trajectory consistency. On Llama-3.1-70B running ReAct on HotpotQA, step-4 hidden-state similarity predicts downstream behavioral consistency (r = -0.35, partial r = -0.45), with a localized temporal and layer-wise signature. The signal replicates across Qwen-2.5-72B and Phi-3-14B, and on StrategyQA (r = -0.83). It does not track correctness: committed-wrong and committed-correct questions are not separable in activation similarity. That boundary is central to the claim. Commitment tells us whether an agent has settled, not whether it is right. A runtime monitor detects inconsistent trajectories from hidden states at AUROC up to 0.97 (0.85--0.88 under a stricter split), and a prompting intervention cuts behavioral variance by 28% against a token-matched control while leaving accuracy statistically unchanged. We also test whether the signal can route self-consistency compute; on a harder benchmark it helps only modestly and is matched by a simpler output-based baseline. The result is a diagnostic for a hidden process failure, with clear limits rather than a general accuracy lever.