ClawEnvKit: 클로형 에이전트를 위한 자동 환경 생성 도구

초록

클로 형태 에이전트의 훈련 및 평가를 위한 환경 구축은 여전히 수작업과 인간 의존적 방식으로 진행되어 확장성이 떨어집니다. 우리는 단순한 데이터셋이 아닌, 다양한 검증된 환경을 필요에 따라 자동 생성할 수 있는 파이프라인이 필요하다고 주장합니다. 이를 위해 우리는 자연어 설명으로부터 이러한 형식론을 구현하는 자율 생성 파이프라인인 ClawEnvKit을 소개합니다. 이 파이프라인은 세 가지 모듈로 구성됩니다: (1) 자연어 입력에서 구조화된 생성 매개변수를 추출하는 파서, (2) 작업 명세, 도구 인터페이스, 점수 구성 설정을 생성하는 생성기, (3) 생성된 환경 전반에 걸쳐 실현 가능성, 다양성, 구조적 타당성, 내부 일관성을 검증하는 검증기. ClawEnvKit을 사용하여 우리는 24개 범주에 걸친 1,040개 환경으로 구성된 클로 형태 에이전트 최초의 대규모 벤치마크인 Auto-ClawEval을 구축했습니다. 실증적으로 Auto-ClawEval은 인간이 직접 구축한 환경과 일관성 및 명확성 측면에서 동등하거나 우수한 성능을 보이면서도 비용은 13,800배 낮습니다. 4개 모델 패밀리와 8개 에이전트 하네스 프레임워크에 걸쳐 평가한 결과, 하네스 엔지니어링이 기본 ReAct 기준선 대비 최대 15.7% 포인트까지 성능을 향상시키며, 완성도는 여전히 주요 변동 축으로 어떤 모델도 벤치마크를 포화시키지 못했고, 자동화된 생성을 통해 이전에는 불가능했던 규모의 평가가 가능해졌습니다. 정적 벤치마킹을 넘어 ClawEnvKit은 라이브 평가를 가능하게 합니다: 사용자가 자연어로 원하는 능력을 설명하면 검증된 환경을 주문형으로 얻을 수 있어 평가를 지속적이고 사용자 주도적인 프로세스로 전환합니다. 동일한 메커니즘은 주문형 훈련 환경 생성기로도 작동하여, 기존 사용자 로그에 국한되지 않고 에이전트의 현재 약점에 적응하는 작업 분포를 생성합니다.

English

Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.

ClawEnvKit: 클로형 에이전트를 위한 자동 환경 생성 도구

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

초록

Support