Claw-Eval-Live:針對演進中真實世界工作流程的即時代理基準測試
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
April 30, 2026
作者: Chenxin Li, Zhengyang Tang, Huangxin Lin, Yunlong Lin, Shijue Huang, Shengyuan Liu, Bowen Ye, Rang Li, Lei Li, Benyou Wang, Yixuan Yuan
cs.AI
摘要
大型語言模型代理被預期能完成跨越軟體工具、商業服務和本地工作空間的端到端工作單元。然而現有的多數代理基準測試在發布時就固化了一套精選任務集,且主要評估最終回應,難以針對演進的工作流程需求來評測代理,也難以驗證任務是否被正確執行。我們推出Claw-Eval-Live——一個面向工作流程代理的動態基準測試框架,其將可刷新的信號層(根據公開工作流程需求信號跨版本更新)與可重現的時間戳版本快照相分離。每個版本均基於公開工作流程需求信號構建,採用當前版本中ClawHub Top-500技能,並具體化為具備固定裝置、服務、工作空間和評分機制的受控任務。在評分方面,Claw-Eval-Live記錄執行軌跡、審計日誌、服務狀態及運行後的工作空間產物,當證據充足時採用確定性檢查,僅對語義維度使用結構化LLM評判。該版本包含105個涵蓋受控商業服務與本地工作空間修復的任務,並依據統一的公開通過規則對13個前沿模型進行評估。實驗表明可靠的流程自動化遠未解決:領先模型僅通過66.7%的任務,且無模型達到70%通過率。失敗模式按任務族系和執行介面呈現結構化特徵,人力資源、管理及多系統商業工作流程為持續瓶頸,而本地工作空間修復相對容易但尚未飽和。僅憑排行榜排名不足為憑,因為通過率相近的模型在整體完成度上可能出現分化,且任務級區分度集中於中等難度區間。Claw-Eval-Live表明工作流程代理評估應雙重錨定於新鮮的外部需求與可驗證的代理行動。
English
LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.