Claw-SWE-Bench: Een benchmark voor het evalueren van agentharnassen in OpenClaw-stijl bij codeertaken

Samenvatting

Algemene agenten zoals OpenClaw worden steeds vaker ingezet als autonome toolgebruikers, maar hun codeervaardigheid is moeilijk te meten onder SWE-bench: een generieke agent voldoet op zichzelf niet aan het schone Docker-werkruimte-, patch- en predictiecontract dat nodig is voor scoring. We introduceren Claw-SWE-Bench, een meertalige benchmark in SWE-bench-stijl en een adapterprotocol dat heterogene agent-harnesses, ofwel klauwen, vergelijkbaar maakt onder eerlijke omstandigheden, waaronder een vast prompt, runtime-budget, werkruimtecontract, patchextractieprocedure en evaluator. De volledige benchmark bevat 350 GitHub-issue-oplossingsinstanties in 8 talen en 43 repositories, afkomstig van SWE-bench-Multilingual en SWE-bench-Verified-Mini na opschoning van toekomstige commits. We brengen ook Claw-SWE-Bench Lite uit voor snellere validatie, een subset van 80 instanties geselecteerd via een kostenbewuste, rangbewuste procedure over 17 kalibratiekolommen. Op de volledige benchmark scoort OpenClaw met een minimale direct-diff-adapter slechts 19,1% Pass@1, terwijl de volledige adapter 73,4% bereikt met dezelfde GLM 5.1-backbone, wat aantoont dat adapterontwerp essentieel is om OpenClaw-achtige harnesses in staat te stellen codeertaken effectief uit te voeren. In een OpenClaw maal negen-model sweep en een vijf-klauw maal twee-model sweep verandert modelkeuze de Pass@1 met 29,4 procentpunt en harnesskeuze met 27,4 procentpunt bij vaste modellen; systemen met vergelijkbare nauwkeurigheid kunnen aanzienlijk verschillen in totale API-kosten. Claw-SWE-Bench behandelt daarom harness- en kostentoerekening als eersteklas assen van SWE-achtige codeeragent-evaluatie en biedt zowel een volledige benchmark als een goedkope referentieset voor reproduceerbare vergelijking. De gegevens zijn beschikbaar op https://github.com/opensquilla/claw-swe-bench en https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.

English

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only 19.1% Pass@1, whereas the full adapter reaches 73.4% with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw times nine-model sweep and a five-claw times two-model sweep, model choice changes Pass@1 by 29.4 pp and harness choice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.