EnterpriseClawBench: 실제 작업장 세션 기반 에이전트 벤치마크

초록

기업 에이전트는 점점 더 작업 공간 내에서 운영된다: 이종 파일을 읽고, 도구를 호출하며, 비즈니스 결과물을 전달한다. 우리는 독점적인 실제 에이전트 세션으로부터 구축된 기업 에이전트 벤치마크인 EnterpriseClawBench를 소개한다. 방대한 작업 공간 세션 아카이브를 기반으로, EnterpriseClawBench는 852개의 재현 가능한 작업을 생성하며, 각 작업에는 복구된 픽스처, 재작성된 프롬프트, 역할 클래스, 기술 서브클래스, 하드 규칙 및 의미론적 루브릭이 쌍으로 제공된다. 세션에 내부 기업 콘텐츠가 포함되어 있으므로 벤치마크 데이터를 공개하지 않는다. 대신, 우리의 재사용 가능한 기여는 구축 및 평가 프로토콜이다. EnterpriseClawBench에서 최고 설정은 0.663(Codex with GPT-5.5)에 불과하다. 이러한 결과는 기업 에이전트 평가가 단일 점수로 성능을 축소하는 것이 아니라, 하네스-모델 조합, 결과물 전달, 시각적 품질, 비용, 런타임 및 기술 전이 행동을 보고해야 함을 보여준다. 코드: https://github.com/FrontisAI/EnterpriseClawBench

English

Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions. Starting from a large archive of workplace sessions, the EnterpriseClawBench produces 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Because the sessions contain internal enterprise content, we do not release the benchmark data; instead, our reusable contribution is the construction and evaluation protocol. On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness--model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior, rather than collapsing performance into a single score. Code: https://github.com/FrontisAI/EnterpriseClawBench