Claw-SWE-Bench: 코딩 작업에서 OpenClaw 방식의 에이전트 하네스를 평가하기 위한 벤치마크

초록

OpenClaw와 같은 범용 에이전트가 자율적 도구 사용자로 점점 더 활용되고 있지만, 이들의 코딩 능력은 SWE-bench에서 측정하기 어렵습니다. 일반적인 에이전트 자체로는 점수 산정에 필요한 정리된 도커 작업 공간, 패치, 예측 계약을 충족하지 못하기 때문입니다. 본 연구에서는 다국어 SWE-bench 스타일 벤치마크와 어댑터 프로토콜인 Claw-SWE-Bench를 제안합니다. 이는 고정된 프롬프트, 실행 시간 예산, 작업 공간 계약, 패치 추출 절차, 평가자를 포함한 공정한 설정 하에서 이종 에이전트 하네스(claw)를 비교 가능하게 만듭니다. 전체 벤치마크는 SWE-bench-Multilingual과 SWE-bench-Verified-Mini에서 미래 커밋 정리를 거쳐 선별된 8개 언어, 43개 저장소에 걸친 350개의 GitHub 이슈 해결 인스턴스로 구성됩니다. 또한 신속한 검증을 위해 Claw-SWE-Bench Lite를 공개하는데, 이는 17개의 보정 열에 대한 비용 인식 및 순위 인식 절차를 통해 선정된 80개 인스턴스 하위 집합입니다. 전체 벤치마크에서 최소 직접-차이 어댑터를 사용한 OpenClaw는 19.1%의 Pass@1 점수를 기록한 반면, 동일한 GLM 5.1 백본에서 전체 어댑터는 73.4%에 도달하여, 어댑터 설계가 OpenClaw 스타일 하네스가 코딩 작업을 효과적으로 수행하는 데 필수적임을 보여줍니다. OpenClaw의 9개 모델 스윕과 5개 클로의 2개 모델 스윕을 통해, 고정 모델 하에서 모델 선택은 Pass@1을 29.4% 포인트, 하네스 선택은 27.4% 포인트 변화시켰으며, 유사한 정확도를 가진 시스템 간에도 총 API 비용이 크게 차이날 수 있음을 확인했습니다. 따라서 Claw-SWE-Bench는 하네스와 비용 회계를 SWE 스타일 코딩 에이전트 평가의 핵심 축으로 간주하며, 전체 벤치마크와 저비용 참조 집합을 제공하여 재현 가능한 비교를 가능하게 합니다. 데이터는 https://github.com/opensquilla/claw-swe-bench 및 https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench에서 확인할 수 있습니다.

English

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only 19.1% Pass@1, whereas the full adapter reaches 73.4% with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw times nine-model sweep and a five-claw times two-model sweep, model choice changes Pass@1 by 29.4 pp and harness choice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.