WildClawBench: 실세계 장기적 에이전트 평가를 위한 벤치마크

초록

대형 언어 모델 및 비전-언어 모델은 명령줄 인터페이스(CLI) 하네스를 통해 사용자를 대신하여 작업을 수행하는 에이전트로 점점 더 많이 활용되고 있다. 그러나 대부분의 에이전트 벤치마크는 여전히 합성 샌드박스, 단기적 작업, 모의 서비스 API, 그리고 최종 답변 확인에 의존하고 있어, 에이전트가 실제 배포된 런타임 환경에서 현실적인 장기 작업을 완료할 수 있는지는 아직 검증되지 않았다. 본 연구는 WildClawBench를 제시한다. 이는 60개의 사람이 작성한 이중 언어 및 멀티모달 작업을 6개 주제 범주에 걸쳐 수집한 네이티브 런타임 벤치마크이다. 각 작업은 평균 약 8분의 실제 경과 시간과 20회 이상의 도구 호출을 소요하며, 모의 서비스가 아닌 실제 도구에 접근할 수 있는 실제 CLI 에이전트 하네스(OpenClaw, Claude Code, Codex 또는 Hermes Agent)가 탑재된 재현 가능한 Docker 컨테이너 내에서 실행된다. 평가는 결정론적 규칙 기반 검사, 부작용에 대한 환경 상태 감사, 그리고 의미론적 검증을 위한 LLM/VLM 판정기를 결합한 혼합 방식을 사용한다. 19개의 최첨단 모델을 대상으로 한 평가에서, 최고 성능을 보인 Claude Opus 4.7이 OpenClaw 환경에서 62.2%의 전체 점수를 기록했으며, 다른 모든 모델은 60% 미만에 머물렀다. 또한, 하네스만 변경해도 단일 모델의 점수가 최대 18포인트까지 차이 났다. 이러한 결과는 장기적이고 네이티브 런타임에서의 에이전트 평가가 현재 최첨단 모델에게 여전히 해결되지 않은 과제임을 보여준다. 우리는 재현 가능한 평가를 지원하기 위해 작업, 코드, 컨테이너화된 도구를 공개한다.

English

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.