EnterpriseClawBench：基于真实工作会话的智能体基准测试

摘要

企业智能体越来越多地在工作空间中运行：它们读取异构文件、调用工具并交付业务成果。我们提出EnterpriseClawBench——一个基于专有真实世界智能体会话构建的企业智能体基准。从大量工作场所会话档案出发，EnterpriseClawBench生成了852个可复现任务，每个任务均配有恢复的固定装置、重写的提示、角色类别、技能子类、硬性规则和语义评估标准。由于会话包含企业内部内容，我们不发布基准数据；我们的可复用贡献在于构建和评估协议。在EnterpriseClawBench上，最佳配置（Codex配合GPT-5.5）仅达到0.663。这些结果表明，企业智能体评估必须报告测试框架与模型的组合、工件交付、视觉质量、成本、运行时和技能迁移行为，而非将性能简化为单一分数。代码：https://github.com/FrontisAI/EnterpriseClawBench

English

Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions. Starting from a large archive of workplace sessions, the EnterpriseClawBench produces 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Because the sessions contain internal enterprise content, we do not release the benchmark data; instead, our reusable contribution is the construction and evaluation protocol. On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness--model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior, rather than collapsing performance into a single score. Code: https://github.com/FrontisAI/EnterpriseClawBench