ChatPaper.aiChatPaper

Claw-SWE-Bench:一個用於評估基於OpenClaw風格之代理框架在編碼任務上表現的基準測試

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

June 10, 2026
作者: Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, Yuchuan Tian, Wei He, Hang Zhou, Jianyuan Guo, Hailin Hu, Lin Ma, Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei, Yunhe Wang, Yu Wang
cs.AI

摘要

通用型代理(如OpenClaw)越來越常被用作自主工具使用者,但其編碼能力難以在SWE-bench下衡量:通用代理本身並不符合評分所需的乾淨Docker工作區、修補程式及預測合約條件。我們推出Claw-SWE-Bench,這是一個多語言、SWE-bench風格的基準測試與配接器協議,能在固定提示詞、執行時間預算、工作區合約、修補程式提取流程及評估器的公平設定下,比較異質代理框架(即「爪」)。完整基準測試包含來自SWE-bench-Multilingual及SWE-bench-Verified-Mini(經未來提交清理後)的350個GitHub問題解決實例,涵蓋8種語言及43個儲存庫。我們同時釋出Claw-SWE-Bench Lite以加速驗證,這是透過成本感知與排名感知程序,從17個校準欄位中選出的80個實例子集。在完整基準測試中,採用最小直接差異配接器的OpenClaw僅獲得19.1%的Pass@1,而完整配接器搭配相同GLM 5.1骨幹模型則達到73.4%,顯示配接器設計對啟用OpenClaw風格的框架有效執行編碼任務至關重要。在OpenClaw搭配九個模型的掃描與五個框架搭配兩個模型的掃描中,模型選擇使Pass@1變化29.4個百分點,而固定模型下的框架選擇則變化27.4個百分點;準確率相近的系統,其API總成本可能差異甚大。因此,Claw-SWE-Bench將框架與成本核算視為SWE風格編碼代理評估的主要軸線,同時提供完整基準測試與低成本參考集,以利可重現比較。數據公佈於https://github.com/opensquilla/claw-swe-bench及https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench。
English
General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only 19.1% Pass@1, whereas the full adapter reaches 73.4% with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw times nine-model sweep and a five-claw times two-model sweep, model choice changes Pass@1 by 29.4 pp and harness choice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.