Claw-Eval：迈向自主智能体的可信评估

摘要

大型语言模型正越来越多地作为自主代理，在现实软件环境中执行多步骤工作流。然而，现有智能体基准测试存在三个关键局限：（1）仅检查最终输出的轨迹不透明评分机制；（2）安全性及鲁棒性评估规范不足；（3）模态覆盖与交互范式单一。我们推出端到端评估套件Claw-Eval，通过三大创新解决上述缺陷：该套件包含经人工核验的300项任务，覆盖通用服务编排、多模态感知生成、多轮专业对话三大类共9个领域；通过执行轨迹、审计日志、环境快照三条独立证据链记录每个代理动作，实现基于2159个细粒度评分项的全轨迹评估；评分协议从完成度、安全性、鲁棒性三维度出发，采用平均分、Pass@k和Pass^k三项指标经三次试验统计，有效区分真实能力与偶然成功。对14个前沿模型的实验表明：（1）轨迹不透明评估系统性不可靠，会遗漏混合管道捕获的44%安全违规和13%鲁棒性故障；（2）受控错误注入主要影响稳定性而非峰值能力，Pass^3最多下降24%而Pass@3保持稳定；（3）多模态性能差异显著，多数模型视频处理弱于文档/图像，且无单一模型在所有模态领先。除基准测试外，Claw-Eval为智能体发展指明实践方向，揭示了构建既具备强大能力又值得信赖的可部署代理的关键路径。

English

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.