Claw-SWE-Bench：一個用於評估基於OpenClaw風格之代理框架在編碼任務上表現的基準測試

摘要

通用型代理（如OpenClaw）越來越常被用作自主工具使用者，但其編碼能力難以在SWE-bench下衡量：通用代理本身並不符合評分所需的乾淨Docker工作區、修補程式及預測合約條件。我們推出Claw-SWE-Bench，這是一個多語言、SWE-bench風格的基準測試與配接器協議，能在固定提示詞、執行時間預算、工作區合約、修補程式提取流程及評估器的公平設定下，比較異質代理框架（即「爪」）。完整基準測試包含來自SWE-bench-Multilingual及SWE-bench-Verified-Mini（經未來提交清理後）的350個GitHub問題解決實例，涵蓋8種語言及43個儲存庫。我們同時釋出Claw-SWE-Bench Lite以加速驗證，這是透過成本感知與排名感知程序，從17個校準欄位中選出的80個實例子集。在完整基準測試中，採用最小直接差異配接器的OpenClaw僅獲得19.1%的Pass@1，而完整配接器搭配相同GLM 5.1骨幹模型則達到73.4%，顯示配接器設計對啟用OpenClaw風格的框架有效執行編碼任務至關重要。在OpenClaw搭配九個模型的掃描與五個框架搭配兩個模型的掃描中，模型選擇使Pass@1變化29.4個百分點，而固定模型下的框架選擇則變化27.4個百分點；準確率相近的系統，其API總成本可能差異甚大。因此，Claw-SWE-Bench將框架與成本核算視為SWE風格編碼代理評估的主要軸線，同時提供完整基準測試與低成本參考集，以利可重現比較。數據公佈於https://github.com/opensquilla/claw-swe-bench及https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench。

English

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only 19.1% Pass@1, whereas the full adapter reaches 73.4% with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw times nine-model sweep and a five-claw times two-model sweep, model choice changes Pass@1 by 29.4 pp and harness choice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.