SaaSBench：探索长周期企业SaaS工程中代码智能体的边界

摘要

随着自主编码代理能够处理越来越长期的任务，它们已逐步展现出完成端到端软件开发的能力。尽管现有基准测试近期已从局部代码编辑演进到从零开始的项目生成，但它们仍局限于结构简化、单一技术栈的应用。因此，这些测试无法捕捉真实企业级软件即服务(SaaS)系统中的异构环境、全栈编排与系统级复杂性，在评估代理面临实际工程约束时留下了关键空白。为填补这一空白，我们提出SaaSBench——首个旨在探索AI代理在企业SaaS工程中能力边界的基准测试。它涵盖6个SaaS领域的30个复杂任务，包含5370个验证节点，整合了8种编程语言、6种数据库和13种框架，细致还原真实软件的异构性。此外，我们针对长期期、多组件耦合的复杂系统设计了一种依赖感知的混合评估范式，实现细粒度、可复现的评估。至关重要的一点是，我们的大量实验揭示了一个惊人发现：最先进代理的主要瓶颈并非生成孤立的代码逻辑，而是成功配置与集成多组件系统。超过95%的任务失败发生在代理触及深层业务逻辑之前，模型常因过度自信而在基础系统搭建阶段过早终止，或陷入无效的调试循环。我们期望SaaSBench能作为一个实用且富有挑战性的测试平台，推动可靠、系统级编码代理的演进。代码已开源在https://github.com/ShadeCloak/SaaSbench。

English

As autonomous coding agents become capable of handling increasingly long-horizon tasks, they have gradually demonstrated the potential to complete end-to-end software development. Although existing benchmarks have recently evolved from localized code editing to from-scratch project generation, they remain confined to structurally simplified, single-stack applications. Consequently, they fail to capture the heterogeneous environments, full-stack orchestration, and system-level complexity of real enterprise Software as a Service (SaaS) systems, leaving a critical gap in assessing agents under realistic engineering constraints. To fill this gap, we introduce SaaSBench, the first benchmark designed to explore the boundaries of AI agents in enterprise SaaS engineering. Spanning 30 complex tasks across 6 SaaS domains with 5,370 validation nodes, it incorporates 8 programming languages, 6 databases, and 13 frameworks to meticulously mirror real-world software heterogeneity. Furthermore, we design a dependency-aware hybrid evaluation paradigm tailored for complex systems with long horizons and multi-component coupling, enabling fine-grained, reproducible assessment. Crucially, our extensive experiments reveal a striking insight: the primary bottleneck for state-of-the-art agents is not generating isolated code logic, but successfully configuring and integrating a multi-component system. Over 95\% of task failures occur before agents even reach deep business logic, with models often falling victim to overconfidence and prematurely halting during foundational system setup, or getting trapped in ineffective debugging loops. We hope SaaSBench serves as a practical and challenging testbed to drive the evolution of reliable, system-level coding agents. The code is available at https://github.com/ShadeCloak/SaaSbench.