Claw-Eval：邁向自主智慧體的可信評估

摘要

大型語言模型正日益以自主智能體形式部署於真實軟體環境中，執行多步驟工作流程。然而現有的智能體基準測試存在三個關鍵缺陷：(1)僅檢查最終輸出的軌跡不透明評分機制；(2)安全性和魯棒性評估未充分規範；(3)狹窄的模態覆蓋範圍與互動範式。我們提出Claw-Eval評估套件，從三個維度進行端到端改進：包含300項經人工驗證的任務，覆蓋三大類別（通用服務編排、多模態感知生成、多輪專業對話）下的9個領域。所有智能體動作均通過三種獨立證據通道（執行軌跡、審計日誌、環境快照）記錄，實現基於2,159個細粒度評分項的軌跡感知評分。評分協議從完成度、安全性、魯棒性三個維度，通過平均分數、Pass@k和Pass^k三項指標（經三次試驗）區分真實能力與偶然結果。對14個前沿模型的實驗表明：(1)軌跡不透明評估系統性不可靠，會遺漏混合管道能捕捉的44%安全違規與13%魯棒性失敗；(2)受控錯誤注入主要降低一致性而非峰值能力，Pass^3最多下降24%而Pass@3保持穩定；(3)多模態性能差異顯著，多數模型影片處理能力弱於文檔或圖像，且無單一模型在所有模態佔優。除基準測試外，Claw-Eval為智能體發展指明可行方向，揭示了構建兼具能力與可靠部署性智能體的關鍵要素。

English

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.

Claw-Eval：邁向自主智慧體的可信評估

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

摘要

Support