EnterpriseOps-Gym:面向企业级场景的状态感知智能体规划与工具使用的环境构建与评估体系
EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings
March 13, 2026
作者: Shiva Krishna Reddy Malay, Shravan Nayak, Jishnu Sethumadhavan Nair, Sagar Davasam, Aman Tiwari, Sathwik Tejaswi Madhusudhan, Sridhar Krishna Nemala, Srinivas Sunkara, Sai Rajeswar
cs.AI
摘要
大型语言模型正从被动信息提供者转变为面向复杂工作流的主动智能体。然而,其在企业环境中作为可靠AI工作者的部署进程,因现有基准测试无法捕捉专业环境的复杂性而受阻——特别是面对持续状态变化和严格访问协议时所需的长期规划能力。本研究推出EnterpriseOps-Gym基准测试,专为评估真实企业环境中的智能体规划能力而设计。该测试平台采用容器化沙箱架构,包含164张数据库表和512个功能工具,以模拟真实场景中的检索摩擦。在此环境中,智能体需完成涵盖客户服务、人力资源、信息技术等八大关键业务领域的1,150项专家级任务评估。我们对14款前沿模型的测试揭示了当前技术的显著局限:表现最佳的Claude Opus 4.5模型成功率仅为37.4%。进一步分析表明,提供人工预设方案可使性能提升14-35个百分点,这凸显战略推理能力是主要瓶颈。此外,智能体对不可行任务的拒绝能力普遍不足(最佳模型仅达53.9%),易导致意外且可能有害的副作用。我们的研究证实当前智能体尚未具备自主部署至企业环境的能力。总体而言,EnterpriseOps-Gym为提升专业工作流中智能体规划的鲁棒性提供了具体测试平台。
English
Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.