EnterpriseOps-Gym:企業級環境下的狀態感知型智能體規劃與工具使用評估平台
EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings
March 13, 2026
作者: Shiva Krishna Reddy Malay, Shravan Nayak, Jishnu Sethumadhavan Nair, Sagar Davasam, Aman Tiwari, Sathwik Tejaswi Madhusudhan, Sridhar Krishna Nemala, Srinivas Sunkara, Sai Rajeswar
cs.AI
摘要
大型語言模型正從被動的資訊提供者轉變為面向複雜工作流程的主動型智慧體。然而,企業要將其部署為可靠AI工作者的進程,卻因現有基準測試無法捕捉專業環境的複雜性而受阻——特別是需要在持續狀態變更與嚴格存取協議下進行長程規劃的特性。本研究提出EnterpriseOps-Gym基準測試,專為評估真實企業場景中的智慧體規劃能力而設計。該框架採用容器化沙箱環境,包含164張資料表與512種功能工具,以模擬現實中的搜尋摩擦。在此環境中,我們透過涵蓋客戶服務、人力資源、資訊科技等八大關鍵業務領域的1,150項專家策劃任務,對智慧體進行評估。對14個前沿模型的測試揭示出關鍵侷限:表現最佳的Claude Opus 4.5成功率僅達37.4%。進一步分析顯示,提供預設人類規劃方案可將效能提升14-35個百分點,證實策略推理是主要瓶頸。此外,智慧體經常無法拒絕不可行任務(最佳模型僅達53.9%成功率),導致非預期且可能有害的副作用。研究結果表明,現有智慧體尚未具備自主部署於企業環境的成熟度。更廣泛而言,EnterpriseOps-Gym為提升專業工作流程中智慧體規劃的穩健性提供了具體測試平台。
English
Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.