WildClawBench: 実世界における長期的なエージェント評価のためのベンチマーク

要旨

大規模言語モデルおよび視覚言語モデルは、コマンドラインインターフェース（CLI）ハーネスを介してユーザーに代わって行動するエージェントをますます強力にしている。しかしながら、ほとんどのエージェントベンチマークは依然として、合成サンドボックス、短期的なタスク、モックサービスAPI、そして最終回答の確認に依存しており、エージェントが実際に展開される実行環境において、現実的な長期的な作業を完了できるかどうかは未解決のままである。本研究では、WildClawBenchを提案する。これは、60の人間が作成したバイリンガル・マルチモーダルタスクから構成され、6つのテーマカテゴリにわたるネイティブランタイムベンチマークである。各タスクは平均して約8分の実行時間と20回以上のツール呼び出しを要し、実際のCLIエージェントハーネス（OpenClaw、Claude Code、Codex、またはHermes Agent）を搭載した再現可能なDockerコンテナ内で実行される。グレーディングはハイブリッド方式であり、決定論的なルールベースのチェック、副作用の環境状態監査、およびセマンティック検証のためのLLM/VLM判定器を組み合わせている。19のフロンティアモデルにおいて、最高性能のClaude Opus 4.7でもOpenClaw環境下での総合スコアは62.2%にとどまり、他のすべてのモデルは60%未満であった。また、ハーネスの変更だけで単一モデルのスコアが最大18ポイント変動した。これらの結果は、長期的かつネイティブランタイムでのエージェント評価が、現在のフロンティアモデルにとって未だ解決からほど遠い課題であることを示している。我々は、再現可能な評価を支援するために、タスク、コード、およびコンテナ化されたツール群を公開する。

English

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.