WildClawBench：面向现实世界长期任务的智能体评估基准

摘要

大型语言模型和视觉语言模型日益驱动着通过命令行界面（CLI）框架代表用户执行操作的智能体。然而，大多数智能体基准测试仍依赖合成沙盒、短期任务、模拟服务API以及最终答案校验，这留下了尚未解答的问题：智能体能否在实际部署的运行时环境中完成真实场景下的长期工作。本研究提出WildClawBench，一个包含60项人工编写、双语、多模态任务的原生运行时基准测试，涵盖六个主题类别。每项任务平均实际运行时间约8分钟，涉及超过20次工具调用，并在可复现的Docker容器中运行。容器内部署了真实的CLI智能体框架（OpenClaw、Claude Code、Codex或Hermes Agent），提供实际工具而非模拟服务。评分采用混合机制，结合确定性规则检查、环境状态副作用审计，以及用于语义验证的LLM/VLM评判器。在19个前沿模型中，表现最佳的Claude Opus 4.7在OpenClaw框架下整体准确率仅为62.2%，而其余模型均低于60%。仅更换框架就能使同一模型的分数波动高达18个百分点。这些结果表明，对于当前前沿模型而言，长期、原生运行时的智能体评估仍是一个远未解决的难题。我们开源了全部任务、代码及容器化工具，以支持可复现的评估。

English

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.