EnterpriseClawBench: 実際の職場セッションに基づくエージェントのベンチマーク

要旨

企业智能体日益在工作空间内运行：它们读取异构文件、调用工具并生成业务文档。为此，我们推出EnterpriseClawBench——一个基于真实企业智能体会话构建的企业级基准测试。从大规模工作场景会话库出发，EnterpriseClawBench生成了852个可复现任务，每个任务都配有恢复的固定配置、重写的提示词、角色类别、技能子类、硬性规则以及语义评估准则。由于这些会话包含企业内部内容，我们未公开基准数据；相反，我们提供的可复用贡献在于其构建与评估协议。在EnterpriseClawBench上，最佳配置（Codex搭配GPT-5.5）仅达到0.663分。这些结果表明，企业智能体评估必须报告框架-模型组合、文档交付质量、视觉质量、成本、运行时间及技能迁移行为，而非将性能简化为单一分数。代码：https://github.com/FrontisAI/EnterpriseClawBench

English

Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions. Starting from a large archive of workplace sessions, the EnterpriseClawBench produces 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Because the sessions contain internal enterprise content, we do not release the benchmark data; instead, our reusable contribution is the construction and evaluation protocol. On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness--model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior, rather than collapsing performance into a single score. Code: https://github.com/FrontisAI/EnterpriseClawBench