Claw-Eval:迈向自主智能体的可信评估
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
April 7, 2026
作者: Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, Tong Yang
cs.AI
摘要
大型语言模型正越来越多地作为自主代理,在现实软件环境中执行多步骤工作流。然而,现有智能体基准测试存在三个关键局限:(1)仅检查最终输出的轨迹不透明评分机制;(2)安全性及鲁棒性评估规范不足;(3)模态覆盖与交互范式单一。我们推出端到端评估套件Claw-Eval,通过三大创新解决上述缺陷:该套件包含经人工核验的300项任务,覆盖通用服务编排、多模态感知生成、多轮专业对话三大类共9个领域;通过执行轨迹、审计日志、环境快照三条独立证据链记录每个代理动作,实现基于2159个细粒度评分项的全轨迹评估;评分协议从完成度、安全性、鲁棒性三维度出发,采用平均分、Pass@k和Pass^k三项指标经三次试验统计,有效区分真实能力与偶然成功。对14个前沿模型的实验表明:(1)轨迹不透明评估系统性不可靠,会遗漏混合管道捕获的44%安全违规和13%鲁棒性故障;(2)受控错误注入主要影响稳定性而非峰值能力,Pass^3最多下降24%而Pass@3保持稳定;(3)多模态性能差异显著,多数模型视频处理弱于文档/图像,且无单一模型在所有模态领先。除基准测试外,Claw-Eval为智能体发展指明实践方向,揭示了构建既具备强大能力又值得信赖的可部署代理的关键路径。
English
Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.