Claw-Eval: 자율 에이전트의 신뢰할 수 있는 평가를 위한 접근

초록

대규모 언어 모델은 실제 소프트웨어 환경에서 다단계 워크플로우를 실행하는 자율 에이전트로 점점 더 많이 배포되고 있습니다. 그러나 기존 에이전트 벤치마크는 세 가지 중요한 한계를 지니고 있습니다: (1) 최종 출력만 확인하는 경로 불투명 평가, (2) 충분히 규정되지 않은 안전성 및 견고성 평가, (3) 협소한 모달리티 범위와 상호작용 패러다임. 본 논문은 이 세 가지 격차를 모두 해소하는 종단간(end-to-end) 평가 도구인 Claw-Eval을 소개합니다. Claw-Eval은 3개 그룹(일반 서비스 오케스트레이션, 멀티모달 인식 및 생성, 다중턴 전문 대화)에 걸친 9개 범주의 300개 인간 검증 작업으로 구성됩니다. 모든 에이전트 행동은 3개의 독립적인 증거 채널(실행 추적, 감사 로그, 환경 스냅샷)을 통해 기록되며, 2,159개의 세분화된 평가 기준 항목에 대한 경로 인식 평가를 가능하게 합니다. 평가 프로토콜은 완성도, 안전성, 견고성을 평가하며, 3회 시행에 걸쳐 평균 점수, Pass@k, Pass^k를 보고하여 운에 의한 결과와 진정한 능력을 구분합니다. 14개의 최첨단 모델에 대한 실험 결과는 다음과 같음을 보여줍니다: (1) 경로 불투명 평가는 체계적으로 신뢰할 수 없으며, 우리의 하이브리드 파이프라인이 포착하는 안전 위반의 44%, 견고성 실패의 13%를 놓침; (2) 통제된 오류 주입은 최고 성능보다 일관성을 주로 저하시키며, Pass^3는 최대 24% 하락하는 반면 Pass@3는 안정적으로 유지됨; (3) 멀티모달 성능은 급격히 변동하며, 대부분의 모델이 문서나 이미지보다 비디오에서 성능이 낮고, 모든 모달리티에서 단일 모델이 압도적으로 우세하지 않음. 벤치마킹을 넘어, Claw-Eval은 단순히 능력만 있는 것이 아니라 안정적으로 배포 가능한 에이전트를 구축하는 데 필요한 요인을 밝히며, 에이전트 개발을 위한 실질적인 방향성을 제시합니다.

English

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.

Claw-Eval: 자율 에이전트의 신뢰할 수 있는 평가를 위한 접근

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

초록

Support