ClawEnvKit: 鉤爪型エージェント向け自動環境生成システム

要旨

鉤爪型エージェントの訓練と評価環境の構築は、現在も人手に依存した手動プロセスであり、スケーラビリティに欠ける。我々は、単なるデータセットではなく、多様で検証済みの環境をオンデマンドで生成可能な自動化パイプラインが必要であると主張する。この目的に向け、自然言語記述からこの形式論を具現化する自律生成パイプライン「ClawEnvKit」を提案する。本パイプラインは3つのモジュールで構成される：(1)自然言語入力から構造化生成パラメータを抽出するパーサ、(2)タスク仕様・ツールインターフェース・評価設定を生成するジェネレータ、(3)生成環境の実現可能性・多様性・構造的正当性・内部一貫性を検証するバリデータである。ClawEnvKitを用いて、24カテゴリ1,040環境から成る初の大規模鉤爪型エージェントベンチマーク「Auto-ClawEval」を構築した。実験では、Auto-ClawEvalが人手編集環境と同等以上の一貫性・明確性を13,800分の1のコストで実現。4モデルファミリー・8種類のエージェントハーネスフレームワークで評価した結果、(1)ハーネス設計がReActベースライン比最大15.7%の性能向上をもたらす、(2)完了率が主要な差異要因でありベンチマークを飽和させるモデルは存在しない、(3)自動生成により従来不可能だった規模での評価が可能となる、ことが明らかになった。ClawEnvKitは静的ベンチマークを超え、自然言語で所望の能力を記述するだけで検証済み環境を即時提供する「ライブ評価」を実現。評価を継続的でユーザ主導のプロセスへと転換する。同メカニズムはオンデマンド訓練環境ジェネレータとしても機能し、既存ユーザログに限定されずエージェントの現行弱点に適応するタスク分布を生成する。

English

Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.

ClawEnvKit: 鉤爪型エージェント向け自動環境生成システム

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

要旨

Support