EnterpriseOps-Gym: 기업 환경을 위한 상태 유지 에이전트 계획 및 도구 활용 환경 및 평가 체계

초록

대규모 언어 모델은 수동적인 정보 제공자에서 복잡한 워크플로우를 위한 능동적 에이전트로 전환되고 있습니다. 그러나 기업 환경에서 신뢰할 수 있는 AI 작업자로서의 배치는 전문 환경의 복잡성을 제대로 반영하지 못하는 벤치마크로 인해 지연되고 있습니다. 특히, 지속적인 상태 변화와 엄격한 접근 프로토콜 속에서 장기적인 계획 수립이 필요하다는 점이 주요 난제입니다. 본 연구에서는 현실적인 기업 환경에서 에이전트 계획 수립 능력을 평가하기 위해 설계된 벤치마크인 EnterpriseOps-Gym을 소개합니다. 구체적으로 EnterpriseOps-Gym은 164개의 데이터베이스 테이블과 512개의 기능적 도구를 갖춘 컨테이너 기반 샌드박스를 통해 실제 검색 마찰을 모방합니다. 이 환경 내에서 에이전트는 8개의 핵심 비즈니스 영역(고객 서비스, 인사, IT 등)에 걸쳐 전문가가 선별한 1,150개의 작업을 수행하며 평가됩니다. 14개의 최첨단 모델을 평가한 결과, 최고 성능을 보인 Claude Opus 4.5조차 37.4%의 성공률에 그치는 등 최신 모델의 심각한 한계가 드러났습니다. 추가 분석 결과, 오라클 인간 계획을 제공하면 성능이 14-35%p 향상되어 전략적 추론이 주요 병목 현상임을 확인했습니다. 또한 에이전트는 실행 불가능한 작업을 제때 거부하지 못하는 경우가 빈번하여(최고 모델 기준 53.9%), 의도치 않으며 잠재적으로 해로운 부작용을 초래했습니다. 이러한 결과는 현재의 에이전트가 기업 자율 배치에 아직 준비되지 않았음을 보여줍니다. 넓게 보면, EnterpriseOps-Gym은 전문 워크플로우에서 에이전트 계획 수립의 견고성을 높일 구체적인 테스트베드를 제공합니다.

English

Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.

EnterpriseOps-Gym: 기업 환경을 위한 상태 유지 에이전트 계획 및 도구 활용 환경 및 평가 체계

EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

초록

Support