Claw-Eval:邁向自主智慧體的可信評估
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
April 7, 2026
作者: Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, Tong Yang
cs.AI
摘要
大型語言模型正日益以自主智能體形式部署於真實軟體環境中,執行多步驟工作流程。然而現有的智能體基準測試存在三個關鍵缺陷:(1)僅檢查最終輸出的軌跡不透明評分機制;(2)安全性和魯棒性評估未充分規範;(3)狹窄的模態覆蓋範圍與互動範式。我們提出Claw-Eval評估套件,從三個維度進行端到端改進:包含300項經人工驗證的任務,覆蓋三大類別(通用服務編排、多模態感知生成、多輪專業對話)下的9個領域。所有智能體動作均通過三種獨立證據通道(執行軌跡、審計日誌、環境快照)記錄,實現基於2,159個細粒度評分項的軌跡感知評分。評分協議從完成度、安全性、魯棒性三個維度,通過平均分數、Pass@k和Pass^k三項指標(經三次試驗)區分真實能力與偶然結果。對14個前沿模型的實驗表明:(1)軌跡不透明評估系統性不可靠,會遺漏混合管道能捕捉的44%安全違規與13%魯棒性失敗;(2)受控錯誤注入主要降低一致性而非峰值能力,Pass^3最多下降24%而Pass@3保持穩定;(3)多模態性能差異顯著,多數模型影片處理能力弱於文檔或圖像,且無單一模型在所有模態佔優。除基準測試外,Claw-Eval為智能體發展指明可行方向,揭示了構建兼具能力與可靠部署性智能體的關鍵要素。
English
Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.