ChatPaper.aiChatPaper

Claw-SWE-Bench:在编程任务上评估OpenClaw风格智能体框架的基准

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

June 10, 2026
作者: Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, Yuchuan Tian, Wei He, Hang Zhou, Jianyuan Guo, Hailin Hu, Lin Ma, Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei, Yunhe Wang, Yu Wang
cs.AI

摘要

诸如OpenClaw之类的通用型智能体日益被用作自主工具使用者,但其编码能力在SWE-bench基准测试中难以衡量:通用智能体本身并不满足评分所需的干净Docker工作区、补丁和预测合约。本文提出Claw-SWE-Bench——一个多语言SWE-bench风格基准测试及适配器协议,该协议使得异构智能体绑定框架(即"爪")能够在固定提示词、运行时预算、工作区合约、补丁提取流程及评估器等公平设定下实现可比性。完整基准测试包含350个GitHub问题修复实例,涵盖8种语言和43个代码仓库,这些实例在剔除未来提交后从SWE-bench-Multilingual和SWE-bench-Verified-Mini中筛选得出。同时,我们发布用于快速验证的Claw-SWE-Bench Lite版本,该子集包含80个实例,通过基于17个校准列的代价感知与排名感知流程选取。在完整基准测试中,采用最小直接差异适配器的OpenClaw仅取得19.1%的Pass@1,而完整适配器在相同GLM 5.1主干模型下达到73.4%,表明适配器设计对于使OpenClaw类框架有效执行编码任务至关重要。通过对OpenClaw进行九模型扫描以及五框架两模型扫描,在固定模型下模型选择使Pass@1变化29.4个百分点,框架选择使Pass@1变化27.4个百分点;具有相近准确率的系统在API总成本上可能存在显著差异。因此,Claw-SWE-Bench将框架与成本核算作为SWE式编码智能体评估的首要维度,既提供完整基准测试,也提供低成本参考集以实现可重复比较。数据获取地址为:https://github.com/opensquilla/claw-swe-bench 和 https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench。
English
General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only 19.1% Pass@1, whereas the full adapter reaches 73.4% with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw times nine-model sweep and a five-claw times two-model sweep, model choice changes Pass@1 by 29.4 pp and harness choice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.