Claw-SWE-Bench：在编程任务上评估OpenClaw风格智能体框架的基准

摘要

诸如OpenClaw之类的通用型智能体日益被用作自主工具使用者，但其编码能力在SWE-bench基准测试中难以衡量：通用智能体本身并不满足评分所需的干净Docker工作区、补丁和预测合约。本文提出Claw-SWE-Bench——一个多语言SWE-bench风格基准测试及适配器协议，该协议使得异构智能体绑定框架（即"爪"）能够在固定提示词、运行时预算、工作区合约、补丁提取流程及评估器等公平设定下实现可比性。完整基准测试包含350个GitHub问题修复实例，涵盖8种语言和43个代码仓库，这些实例在剔除未来提交后从SWE-bench-Multilingual和SWE-bench-Verified-Mini中筛选得出。同时，我们发布用于快速验证的Claw-SWE-Bench Lite版本，该子集包含80个实例，通过基于17个校准列的代价感知与排名感知流程选取。在完整基准测试中，采用最小直接差异适配器的OpenClaw仅取得19.1%的Pass@1，而完整适配器在相同GLM 5.1主干模型下达到73.4%，表明适配器设计对于使OpenClaw类框架有效执行编码任务至关重要。通过对OpenClaw进行九模型扫描以及五框架两模型扫描，在固定模型下模型选择使Pass@1变化29.4个百分点，框架选择使Pass@1变化27.4个百分点；具有相近准确率的系统在API总成本上可能存在显著差异。因此，Claw-SWE-Bench将框架与成本核算作为SWE式编码智能体评估的首要维度，既提供完整基准测试，也提供低成本参考集以实现可重复比较。数据获取地址为：https://github.com/opensquilla/claw-swe-bench 和 https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench。

English

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only 19.1% Pass@1, whereas the full adapter reaches 73.4% with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw times nine-model sweep and a five-claw times two-model sweep, model choice changes Pass@1 by 29.4 pp and harness choice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.