에이전트의 정확한 실패 예측이 효과적인 실패 방지를 의미하지는 않는다

초록

LLM 비평 모델의 사전적 개입이 신뢰성을 향상시킬 것이라고 흔히 가정되지만, 실제 배포 시점에서의 효과는 제대로 이해되지 않고 있습니다. 우리는 강력한 오프라인 정확도(AUROC 0.94)를 보이는 이진 LLM 비평 모델이 오히려 심각한 성능 저하를 초래할 수 있음을 보여줍니다. 한 모델에서는 26%p(percentage point)의 붕괴를 유발한 반면, 다른 모델에서는 거의 0%p의 영향을 미쳤습니다. 이러한 변동성은 LLM 비평 모델의 정확도만으로는 개입이 안전한지 판단하기에 불충분함을 입증합니다. 우리는 *방해-회복 상충관계*를 규명했습니다. 즉, 개입이 실패하는 진행 경로를 회복시킬 수도 있지만, 본래 성공했을 진행 경로를 방해할 수도 있다는 것입니다. 이러한 통찰을 바탕으로 우리는 전체 배포 없이도 소규모 파일럿(50개 작업)을 사용하여 개입이 도움이 될지 해가 될지 예측하는 배포 전 테스트를 제안합니다. 다양한 벤치마크에서 이 테스트는 결과를 정확히 예측했습니다: 개입은 높은 성공률 작업에서는 성능을 저하시켰고(0에서 -26%p), 높은 실패율을 보인 ALFWorld 벤치마크에서는 소폭의 개선을 가져왔습니다(+2.8%p, p=0.014). 따라서 우리 프레임워크의 주요 가치는 언제 개입하지 말아야 하는지를 식별함으로써 배포 전에 심각한 성능 회귀를 방지하는 데 있습니다.

English

Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentage point (pp) collapse on one model while affecting another by near zero pp. This variability demonstrates that LLM critic accuracy alone is insufficient to determine whether intervention is safe. We identify a disruption-recovery tradeoff: interventions may recover failing trajectories but also disrupt trajectories that would have succeeded. Based on this insight, we propose a pre-deployment test that uses a small pilot of 50 tasks to estimate whether intervention is likely to help or harm, without requiring full deployment. Across benchmarks, the test correctly anticipates outcomes: intervention degrades performance on high-success tasks (0 to -26 pp), while yielding a modest improvement on the high-failure ALFWorld benchmark (+2.8 pp, p=0.014). The primary value of our framework is therefore identifying when not to intervene, preventing severe regressions before deployment.

에이전트의 정확한 실패 예측이 효과적인 실패 방지를 의미하지는 않는다

Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

초록

Support