EnterpriseOps-Gym: エンタープライズ環境におけるステートフルなエージェント計画とツール利用のための環境と評価

要旨

大規模言語モデルは、受動的な情報提供者から、複雑なワークフローを遂行する能動的エージェントへと移行しつつある。しかし、企業における信頼性の高いAIワーカーとしての展開は、専門的環境の複雑さ、特に永続的な状態変化と厳格なアクセスプロトコルの中での長期的計画の必要性を十分に捉えられないベンチマークによって停滞している。本研究では、現実的な企業環境におけるエージェントの計画立案能力を評価するために設計されたベンチマーク「EnterpriseOps-Gym」を提案する。具体的には、EnterpriseOps-Gymは、164のデータベーステーブルと512の機能ツールを備えたコンテナ化されたサンドボックスを特徴とし、実世界の検索摩擦を模倣する。この環境内で、エージェントは8つのミッションクリティカルな分野（カスタマーサービス、人事、ITを含む）にわたる1,150の専門家によって精選されたタスクについて評価される。14の先進モデルを評価した結果、最先端モデルにも重大な限界があることが明らかになった：最高性能のClaude Opus 4.5でさえ、成功率は37.4%に留まった。さらに分析すると、オラクルな人間の計画を提供することで性能が14～35パーセントポイント向上し、戦略的推論が主要なボトルネックであることが特定された。加えて、エージェントは実行不可能なタスクを拒否することに頻繁に失敗し（最高性能モデルでも53.9%）、意図しない、そして潜在的に有害な副作用を引き起こすことが分かった。我々の知見は、現在のエージェントが自律的な企業展開の準備がまだ整っていないことを強調する。より広く見れば、EnterpriseOps-Gymは、専門的ワークフローにおけるエージェントの計画立案の堅牢性を向上させるための具体的なテストベッドを提供する。

English

Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.

EnterpriseOps-Gym: エンタープライズ環境におけるステートフルなエージェント計画とツール利用のための環境と評価

EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

要旨

Support