Claw-Eval-Live：進化する実世界ワークフロー向けライブエージェントベンチマーク

要旨

LLMエージェントは、ソフトウェアツール、ビジネスサービス、ローカルワークスペースにわたるエンドツーエンドの作業単位を完了することが期待されています。しかし、多くのエージェントベンチマークは、厳選されたタスクセットをリリース時に固定し、主に最終応答を評価するため、進化するワークフロー需要に対するエージェントの評価や、タスクが実際に実行されたかどうかの検証が困難です。我々はClaw-Eval-Liveを紹介します。これはワークフローエージェントのためのライブベンチマークであり、公開されたワークフロー需要シグナルからリリースを跨いで更新されるリフレッシュ可能なシグナル層を、再現可能なタイムスタンプ付きリリーススナップショットから分離します。各リリースは公開されたワークフロー需要シグナルから構築され、当該リリースで使用されるClawHub Top-500スキルを含み、固定されたフィクスチャ、サービス、ワークスペース、評価器を持つ制御されたタスクとして具体化されます。評価において、Claw-Eval-Liveは実行トレース、監査ログ、サービス状態、実行後のワークスペース成果物を記録し、証拠が十分な場合は決定論的チェックを使用し、意味的次元の評価にのみ構造化されたLLM判定を利用します。本リリースでは、制御されたビジネスサービスとローカルワークスペース修復にわたる105のタスクを含み、共有の公開合格基準の下で13の最先端モデルを評価します。実験結果から、信頼性の高いワークフロー自動化は未解決のままであり、最高性能のモデルでもタスクの66.7%しか合格せず、70%に達するモデルはないことが明らかになりました。失敗はタスクファミリーと実行サーフェスによって構造化され、HR、管理、マルチシステムビジネスワークフローが持続的なボトルネックとなり、ローカルワークスペース修復は比較的容易であるものの飽和していません。リーダボード順位だけでは不十分です。なぜなら、合格率が類似するモデルでも総合完了率で差が生じ、タスクレベルの識別能力は中難易度帯のタスクに集中するためです。Claw-Eval-Liveは、ワークフローエージェントの評価は、新たな外部需要と検証可能なエージェント行動の両方に基づいて行われるべきであることを示唆しています。

English

LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.

Claw-Eval-Live：進化する実世界ワークフロー向けライブエージェントベンチマーク

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

要旨

Support