準確的代理人失敗預測並不等同於有效的失敗預防

摘要

大型語言模型批評者模型的主動干預常被認為能提升可靠性，但其在部署階段的實際影響卻鮮少被深入理解。我們發現，即使具備強大離線準確度（AUROC 0.94）的二值化LLM批評者，仍可能引發嚴重性能衰退：在某模型上導致26個百分點的崩潰式下滑，而對另一模型的影響卻近乎零百分點。這種差異性表明，僅憑LLM批評者的準確度不足以判定干預是否安全。我們揭示了「干擾-恢復權衡」機制：干預雖能挽救失敗的執行軌跡，但也可能破壞原本會成功的軌跡。基於此洞見，我們提出一項部署前檢驗方法，僅需50項任務的小型試驗即可預估干預的利弊，無需全面部署。跨基準測試的結果顯示，該檢驗能準確預測成效：對高成功率任務的干預會導致性能退化（0至-26個百分點），而在高失敗率的ALFWorld基準上則產生小幅改善（+2.8個百分點，p=0.014）。因此，本框架的核心價值在於能識別不應干預的時機，從而預防部署前的嚴重性能衰退。

English

Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentage point (pp) collapse on one model while affecting another by near zero pp. This variability demonstrates that LLM critic accuracy alone is insufficient to determine whether intervention is safe. We identify a disruption-recovery tradeoff: interventions may recover failing trajectories but also disrupt trajectories that would have succeeded. Based on this insight, we propose a pre-deployment test that uses a small pilot of 50 tasks to estimate whether intervention is likely to help or harm, without requiring full deployment. Across benchmarks, the test correctly anticipates outcomes: intervention degrades performance on high-success tasks (0 to -26 pp), while yielding a modest improvement on the high-failure ALFWorld benchmark (+2.8 pp, p=0.014). The primary value of our framework is therefore identifying when not to intervene, preventing severe regressions before deployment.

準確的代理人失敗預測並不等同於有效的失敗預防

Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

摘要

Support