ClawGym：构建高效爪式智能体的可扩展框架

摘要

爪式环境支持对本地文件、工具及持久化工作空间状态进行多步骤工作流操作。然而由于缺乏系统化框架，特别是可验证训练数据合成及其与智能体训练、诊断评估相结合的体系，该类环境的规模化开发仍受限制。为解决这一挑战，我们提出ClawGym——一个支持爪式个人智能体全生命周期开发的规模化框架。具体而言，我们构建了ClawGym-SynData数据集，该数据集包含1.35万项经筛选的合成任务，这些任务源自角色驱动意图与技能锚定操作的组合，并配有模拟真实工作空间及混合验证机制。我们随后通过黑盒推演轨迹的监督微调，训练出系列高性能爪式模型（称为ClawGym-Agents），并借助跨任务沙箱的并行化推演轻量级管道进一步探索强化学习。为支撑可靠评估，我们还构建了ClawGym-Bench基准，包含200个经过自动化筛选和人机协同校验的测试实例。相关资源即将发布于https://github.com/ClawGym。

English

Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes.To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources will be soon released at https://github.com/ClawGym.