ChatPaper.aiChatPaper

当智能体过早承诺时:诊断LLM智能体中的过早承诺

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

June 22, 2026
作者: Aman Mehta
cs.AI

摘要

长视野LLM智能体可能悄无声息地失败:它们过早锁定某一证据解读,随后整个运行过程都用于维护这一解读。我们将此称为“过早确定”。最终答案评分无法捕捉此失效模式,因为它只关注答案本身,而不关注过程是否已崩溃至稳定路径。我们将“表征承诺”定义为在固定推理步骤上,跨运行的隐藏状态收敛,并将其作为轨迹一致性的早期诊断指标。在运行ReAct于HotpotQA的Llama-3.1-70B模型上,第4步隐藏状态相似性能够预测下游行为一致性(r = -0.35,偏相关 r = -0.45),且具有局部时间与逐层特征。该信号在Qwen-2.5-72B和Phi-3-14B模型以及StrategyQA基准上得以复现(r = -0.83)。但该信号并不追踪正确性:在激活相似性上,已承诺但错误的问题与已承诺且正确的问题之间无法区分。这一分界正是该论点的核心。承诺告诉我们智能体是否已确定,而非其是否正确。运行时监控器可通过隐藏状态检测不一致轨迹,AUROC高达0.97(严格拆分下为0.85–0.88),而提示干预将(与令牌匹配对照组相比)行为方差降低28%,同时准确率在统计上保持不变。我们还测试了该信号是否能引导自一致性计算;在更难的基准上,其效果仅属中等,且被基于输出的简单基线所超越。结果提供了针对隐蔽过程故障的诊断工具,具有明确边界,而非通用的准确率提升杠杆。
English
Long-horizon LLM agents can fail quietly: they settle on one reading of the evidence early, then spend the rest of the run defending it. We call this premature commitment. Final-answer scoring misses the failure mode because it sees only the answer, not whether the process has already collapsed to a stable path. We define representational commitment as cross-run hidden-state convergence at a fixed reasoning step, and use it as an early diagnostic of trajectory consistency. On Llama-3.1-70B running ReAct on HotpotQA, step-4 hidden-state similarity predicts downstream behavioral consistency (r = -0.35, partial r = -0.45), with a localized temporal and layer-wise signature. The signal replicates across Qwen-2.5-72B and Phi-3-14B, and on StrategyQA (r = -0.83). It does not track correctness: committed-wrong and committed-correct questions are not separable in activation similarity. That boundary is central to the claim. Commitment tells us whether an agent has settled, not whether it is right. A runtime monitor detects inconsistent trajectories from hidden states at AUROC up to 0.97 (0.85--0.88 under a stricter split), and a prompting intervention cuts behavioral variance by 28% against a token-matched control while leaving accuracy statistically unchanged. We also test whether the signal can route self-consistency compute; on a harder benchmark it helps only modestly and is matched by a simpler output-based baseline. The result is a diagnostic for a hidden process failure, with clear limits rather than a general accuracy lever.