エージェントにおける正確な失敗予測は効果的な失敗予防を意味しない

要旨

LLM批評モデルによる能動的介入は信頼性向上に寄与すると一般に考えられていますが、実際の運用時の影響については十分に理解されていません。我々は、強力なオフライン精度（AUROC 0.94）を有する二値LLM批評モデルが、深刻な性能劣化を引き起こし得ることを実証しました。具体的には、あるモデルでは26パーセントポイント（pp）の性能崩壊を誘発する一方で、別のモデルにはほぼ影響を与えない（±0 pp）という現象を観測しました。この変動性は、LLM批評モデルの精度のみでは介入の安全性を判断できないことを示しています。我々は「中断-回復のトレードオフ」を特定しました。すなわち、介入は失敗軌道を回復させる可能性がある一方で、本来成功するはずだった軌道を妨害するリスクもあるのです。この知見に基づき、我々は本格導入前に、50タスクという小規模なパイロットテストで介入の有効性を推定する事前評価手法を提案します。各種ベンチマークでの検証では、本テストが結果を正確に予測しました：高成功率タスクでは介入が性能を劣化させ（0～-26 pp）、一方で高失敗率のALFWorldベンチマークでは控えめな改善（+2.8 pp, p=0.014）をもたらしました。したがって、本フレームワークの主たる価値は、深刻な性能後退を未然に防ぐ「介入すべきでない状況」を特定する点にあります。

English

Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentage point (pp) collapse on one model while affecting another by near zero pp. This variability demonstrates that LLM critic accuracy alone is insufficient to determine whether intervention is safe. We identify a disruption-recovery tradeoff: interventions may recover failing trajectories but also disrupt trajectories that would have succeeded. Based on this insight, we propose a pre-deployment test that uses a small pilot of 50 tasks to estimate whether intervention is likely to help or harm, without requiring full deployment. Across benchmarks, the test correctly anticipates outcomes: intervention degrades performance on high-success tasks (0 to -26 pp), while yielding a modest improvement on the high-failure ALFWorld benchmark (+2.8 pp, p=0.014). The primary value of our framework is therefore identifying when not to intervene, preventing severe regressions before deployment.

エージェントにおける正確な失敗予測は効果的な失敗予防を意味しない

Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

要旨

Support