ClawEnvKit:爪式智能体的自动化环境生成系统
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
April 20, 2026
作者: Xirui Li, Ming Li, Derry Xu, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh, Tianyi Zhou
cs.AI
摘要
目前,为爪式智能体构建训练和评估环境仍需依赖人工密集型的手动流程,难以实现规模化扩展。我们认为,解决这一问题的关键不仅在于数据集,更需要能够按需生成多样化、可验证环境的自动化流程。为此,我们推出了ClawEnvKit——一个能够将自然语言描述实例化为形式化环境的自主生成系统。该流水线包含三大模块:(1)从自然语言输入中提取结构化生成参数的解析器;(2)生成任务规范、工具接口及评分配置的生成器;(3)对生成环境进行可行性、多样性、结构有效性及内部一致性验证的校验器。
基于ClawEnvKit,我们构建了首个大规模爪式智能体基准测试集Auto-ClawEval,涵盖24个类别的1,040个环境。实证表明,Auto-ClawEval在环境连贯性与清晰度方面达到甚至超越人工构建水平,而成本仅为后者的1/13,800。通过对4大模型家族和8种智能体框架的评估,我们发现:框架工程相较基础ReAct基线可提升15.7个百分点的性能;任务完成度仍是主要差异维度,尚无模型能完全覆盖该基准;自动化生成实现了传统方法难以企及的大规模评估。
除静态基准测试外,ClawEnvKit还支持动态实时评估:用户通过自然语言描述所需能力,即可按需获取已验证环境,使评估转变为持续的用户驱动过程。该机制同样可作为按需训练环境生成器,产出能适配智能体当前弱点的任务分布,而非受限于现有用户日志数据。
English
Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.