HiL-Bench(휴먼 인 더 루프 벤치마크): 에이전트는 도움이 필요할 때 요청할 줄 아는가?

초록

최첨단 코딩 에이전트는 완전한 맥락이 주어지면 복잡한 작업을 해결하지만, 명세가 불완전하거나 모호할 경우 성능이 급격히 저하됩니다. 핵심 병목 현상은 원시 능력이 아니라 판단력, 즉 언제 자율적으로 행동하고 언제 도움을 요청해야 하는지를 아는 데 있습니다. 현재 벤치마크는 이러한 실패 모드를 인지하지 못합니다. 기존 벤치마크는 명확한 세부 지침을 제공하고 실행 정확도만을 평가하므로, 누락된 요구사항을 운 좋게 추측한 에이전트도 확인을 위해 질문을 할 에이전트와 동일한 점수를 받게 됩니다. 본 논문에서는 이러한 선택적 에스컬레이션(escalation) 능력을 측정하기 위한 HiL-Bench(Human-in-the-Loop Benchmark)를 제시합니다. 각 작업에는 사전 검토가 아닌 점진적 탐색을 통해서만 드러나는 인간 검증 차단 요소(누락된 정보, 모호한 요청, 상충되는 정보)가 포함됩니다. 핵심 메트릭인 Ask-F1(질문 정밀도와 차단 요소 재현율의 조화평균)은 과도한 질문과 묵시적 추측 사이의 긴장 관계를 포착하며, 그 구조상 질문 스팸을 통한 벤치마크 악용을 근본적으로 방지합니다. SWE(소프트웨어 엔지니어링) 및 텍스트-to-SQL 도메인에서의 평가 결과, 도움 요청 여부를 결정해야 하는 상황에서 최첨단 모델이라도 완전한 정보가 주어졌을 때의 성능에 비해 극히 일부만 회복하는 보편적이며 큰 판단력 격차가 존재함을 확인했습니다. 실패 분석을 통해 세 가지 주요 도움 요청 패턴을 확인했습니다: 차단 요소를 전혀 인지하지 못한 채 과신하여 오류를 범하는 경우, 높은 불확실성을 감지하지만 지속적으로 오류를 범하는 경우, 자기 수정 없이 광범위하고 부정확하게 에스컬레이션하는 경우. 이러한 일관된 패턴은 열악한 도움 요청이 작업별 문제가 아닌 모델 수준의 결함임을 입증합니다. Ask-F1 보상을 이용한 RL 훈련을 통해 판단력이 학습 가능함을 확인했습니다: 32B 모델이 도움 요청 질과 작업 성공률을 모두 개선했으며, 이 개선 효과는 도메인 간에 전이되었습니다. 해당 모델은 언제 질문해야 하는지에 대한 도메인 특정 휴리스틱을 학습하는 것이 아니라, 해결 불가능한 불확실성을 감지하고 이를 바탕으로 행동하는 법을 학습했습니다.

English

Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain. We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the tension between over-asking and silent guessing; its structure architecturally prevents gaming through question spam. Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model recovers more than a fraction of its full-information performance when deciding whether to ask. Failure analysis identifies three key help-seeking patterns: overconfident wrong beliefs with no gap detection; high uncertainty detection yet persistent errors; broad, imprecise escalation without self-correction. These consistent patterns confirm poor help-seeking is a model-level flaw, not task-specific. RL training on shaped Ask-F1 reward shows judgment is trainable: a 32B model improves both help-seeking quality and task pass rate, with gains that transfer across domains. The model does not learn domain-specific heuristics for when to ask; it learns to detect unresolvable uncertainty and act on it.

HiL-Bench(휴먼 인 더 루프 벤치마크): 에이전트는 도움이 필요할 때 요청할 줄 아는가?

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

초록

Support