WildClawBench：面向真實世界、長時程智能體評估的基準

摘要

大型語言與視覺語言模型日益驅動代理程式透過命令列介面（CLI）框架代表使用者執行任務。然而，多數代理程式基準測試仍依賴合成沙盒、短時程任務、模擬服務API及最終答案檢查，未能驗證代理程式能否在其實際部署的運行環境中完成真實的長時程工作。本研究提出 WildClawBench，一個原生運行環境基準測試，包含60項由人工撰寫、雙語、多模態的任務，涵蓋六大主題類別。每項任務平均耗費約8分鐘實際時間及超過20次工具呼叫，並在可重現的Docker容器內執行，該容器搭載真正的CLI代理框架（OpenClaw、Claude Code、Codex 或 Hermes Agent），使用真實工具而非模擬服務。評分方式為混合制，結合確定性規則檢查、環境狀態副作用審計，以及用於語義驗證的LLM/VLM評判器。在19個前沿模型中，最佳模型Claude Opus 4.7在OpenClaw框架下總體得分僅達62.2%，其他所有模型均低於60%；而僅是切換框架就使單一模型得分變動高達18個百分點。這些結果顯示，對於當前前沿模型而言，長時程、原生運行環境的代理評估仍是遠未解決的課題。我們釋出任務、程式碼與容器化工具，以支援可重現的評估。

English

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.