Claw-SWE-Bench: コーディングタスクにおけるOpenClaw方式のエージェントハーネス評価ベンチマーク

要旨

OpenClawのような汎用エージェントは自律的なツール使用者としてますます利用されているが、そのコーディング能力はSWE-benchの下で測定することが難しい。なぜなら、汎用エージェントはそれ自体では、スコアリングに必要なクリーンなDockerワークスペース、パッチ、および予測契約を満たさないからである。我々はClaw-SWE-Benchを導入する。これは多言語のSWE-benchスタイルのベンチマークとアダプタプロトコルであり、固定プロンプト、ランタイム予算、ワークスペース契約、パッチ抽出手順、評価者を含む公平な設定下で、異種のエージェントハーネス（クロー）を比較可能にする。完全版ベンチマークは、将来コミットのクリーンアップ後にSWE-bench-MultilingualとSWE-bench-Verified-Miniから抽出された、8言語43リポジトリにわたる350のGitHub issue解決インスタンスを含む。また、より高速な検証のためにClaw-SWE-Bench Liteも公開する。これは17のキャリブレーションカラムに対してコスト認識・ランク認識の手順で選択された80インスタンスのサブセットである。完全版ベンチマークにおいて、最小限のdirect-diffアダプタを用いたOpenClawは19.1%のPass@1しか得られないのに対し、同一のGLM 5.1バックボーンを用いた完全版アダプタは73.4%に達しており、アダプタ設計がOpenClawスタイルのハーネスがコーディングタスクを効果的に実行するために不可欠であることを示している。OpenClaw×9モデルのスイープと5クロー×2モデルのスイープにおいて、モデル選択はPass@1を29.4パーセントポイント変化させ、固定モデル下でのハーネス選択は27.4パーセントポイント変化させる。類似した精度のシステムでも、総APIコストは大きく異なる可能性がある。したがってClaw-SWE-Benchは、ハーネスとコスト計算をSWEスタイルのコーディングエージェント評価の第一級の軸として扱い、完全版ベンチマークと再現可能な比較のための低コスト参照セットの両方を提供する。データはhttps://github.com/opensquilla/claw-swe-benchおよびhttps://huggingface.co/datasets/TokenRhythm/Claw-SWE-Benchで入手可能である。

English

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only 19.1% Pass@1, whereas the full adapter reaches 73.4% with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw times nine-model sweep and a five-claw times two-model sweep, model choice changes Pass@1 by 29.4 pp and harness choice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.