Claw-Eval-Live:面向动态真实世界工作流的实时智能体基准测试
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
April 30, 2026
作者: Chenxin Li, Zhengyang Tang, Huangxin Lin, Yunlong Lin, Shijue Huang, Shengyuan Liu, Bowen Ye, Rang Li, Lei Li, Benyou Wang, Yixuan Yuan
cs.AI
摘要
大语言模型智能体需在软件工具、业务服务和本地工作空间间完成端到端的工作单元。然而现有智能体基准测试往往在发布时固化预设任务集,且主要依据最终响应进行评分,难以评估智能体应对动态工作流需求的能力,也无法验证任务是否真实执行。我们推出Claw-Eval-Live——一个面向工作流智能体的动态基准测试框架,其将可刷新的信号层(根据公共工作流需求信号持续更新)与可复现的时间戳发布快相分离。每个版本均基于公共工作流需求信号构建,采用当季ClawHub Top-500技能,并通过固定装置、服务、工作空间和评分器实现标准化任务封装。在评分方面,Claw-Eval-Live记录执行轨迹、审计日志、服务状态及运行后工作空间产物,当证据充足时采用确定性检查,仅对语义维度使用结构化大语言模型评判。当前版本包含105项涵盖标准化业务服务与本地工作空间修复的任务,依据统一公共通过规则对13个前沿模型进行评估。实验表明可靠的工作流自动化仍远未解决:领先模型仅通过66.7%的任务,尚无模型达到70%通过率。失败模式按任务族和执行界面呈现结构化特征,其中人力资源、管理及多系统业务工作流为持续瓶颈,而本地工作空间修复相对容易但尚未饱和。仅凭排行榜排名并不充分,因为通过率相近的模型在整体完成度上可能分化,且任务级区分度集中体现在中等难度区间。Claw-Eval-Live表明工作流智能体评估需实现双重锚定:既要扎根于动态外部需求,又要基于可验证的智能体行动。
English
LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.