EnterpriseOps-Gym: Omgevingen en Evaluaties voor Stateful Agent-gebaseerd Plannen en Gebruik van Hulpmiddelen in Bedrijfsomgevingen

Samenvatting

Grote taalmodellen verschuiven van passieve informatieverstrekkers naar actieve agents die bedoeld zijn voor complexe workflows. Hun inzet als betrouwbare AI-werkers in bedrijfsomgevingen wordt echter belemmerd door benchmarks die de complexiteit van professionele omgevingen niet weergeven, met name de behoefte aan planning op de lange termijn te midden van aanhoudende statuswijzigingen en strikte toegangsprotocollen. In dit werk introduceren we EnterpriseOps-Gym, een benchmark die is ontworpen om agent-gebaseerde planning in realistische bedrijfsomgevingen te evalueren. Specifiek kenmerkt EnterpriseOps-Gym zich door een gecontaineriseerde sandbox met 164 databasetabellen en 512 functionele tools om zoekfrictie uit de echte wereld na te bootsen. Binnen deze omgeving worden agents geëvalueerd op 1.150 door experts samengestelde taken, verspreid over acht kritieke bedrijfsdomeinen (waaronder Klantenservice, HR en IT). Onze evaluatie van 14 toonaangevende modellen onthult kritieke beperkingen in de huidige state-of-the-art modellen: het best presterende Claude Opus 4.5 behaalt slechts een slagingspercentage van 37,4%. Verdere analyse toont aan dat het verstrekken van 'oracle' menselijke plannen de prestaties met 14-35 procentpunten verbetert, wat strategisch redeneren aanwijst als de primaire bottleneck. Daarnaast slagen agents er vaak niet in om onuitvoerbare taken te weigeren (het beste model behaalt 53,9%), wat leidt tot onbedoelde en potentieel schadelijke neveneffecten. Onze bevindingen benadrukken dat huidige agents nog niet klaar zijn voor autonome inzet in bedrijven. In bredere zin biedt EnterpriseOps-Gym een concrete testomgeving om de robuustheid van agent-gebaseerde planning in professionele workflows te verbeteren.

English

Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.

EnterpriseOps-Gym: Omgevingen en Evaluaties voor Stateful Agent-gebaseerd Plannen en Gebruik van Hulpmiddelen in Bedrijfsomgevingen

EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

Samenvatting

Support