智能体中的准确失败预测并不等同于有效的失败预防

摘要

大型语言模型批评者模型的主动干预常被认为能提升可靠性，但其在部署时的实际影响尚不明确。我们发现，即使具备强劲的离线准确率（AUROC达0.94）的二元LLM批评者，仍可能引发严重的性能衰退：在某模型上导致26个百分点的断崖式下滑，而对另一模型的影响近乎为零。这种差异性表明，仅凭LLM批评者的准确率不足以判断干预是否安全。我们揭示了干预过程中的"破坏-修复"权衡：干预可能挽救失败轨迹，但也会破坏本应成功的轨迹。基于此发现，我们提出一种预部署测试方法，仅需50项任务的小规模试点即可预估干预的利弊，无需全面部署。在多项基准测试中，该测试均能准确预测结果：对高成功率任务的干预会导致性能下降（0至-26个百分点），而在高失败率的ALFWorld基准上则产生小幅改善（+2.8个百分点，p=0.014）。因此，我们框架的核心价值在于能识别不应干预的场景，在部署前防范严重的性能回退。

English

Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentage point (pp) collapse on one model while affecting another by near zero pp. This variability demonstrates that LLM critic accuracy alone is insufficient to determine whether intervention is safe. We identify a disruption-recovery tradeoff: interventions may recover failing trajectories but also disrupt trajectories that would have succeeded. Based on this insight, we propose a pre-deployment test that uses a small pilot of 50 tasks to estimate whether intervention is likely to help or harm, without requiring full deployment. Across benchmarks, the test correctly anticipates outcomes: intervention degrades performance on high-success tasks (0 to -26 pp), while yielding a modest improvement on the high-failure ALFWorld benchmark (+2.8 pp, p=0.014). The primary value of our framework is therefore identifying when not to intervene, preventing severe regressions before deployment.

智能体中的准确失败预测并不等同于有效的失败预防

Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

摘要

Support