SaaSBench: 장기적 엔터프라이즈 SaaS 엔지니어링에서 코딩 에이전트의 경계 탐구

초록

자율 코딩 에이전트가 점점 더 장기적인 작업을 처리할 수 있게 되면서, 엔드투엔드 소프트웨어 개발을 완료할 수 있는 잠재력을 점차 입증해 왔다. 기존 벤치마크는 최근 국소적 코드 편집에서 처음부터 프로젝트를 생성하는 방식으로 진화했지만, 여전히 구조적으로 단순화된 단일 스택 애플리케이션에 국한되어 있다. 결과적으로 실제 기업용 소프트웨어 서비스(SaaS) 시스템의 이질적 환경, 풀스택 오케스트레이션, 시스템 수준 복잡성을 포착하지 못하여, 현실적인 엔지니어링 제약 조건에서 에이전트를 평가하는 데 중요한 격차를 남기고 있다. 이러한 격차를 해소하기 위해, 우리는 기업 SaaS 엔지니어링에서 AI 에이전트의 경계를 탐색하도록 설계된 최초의 벤치마크인 SaaSBench를 소개한다. SaaSBench는 6개의 SaaS 도메인에 걸친 30개의 복잡한 작업과 5,370개의 검증 노드로 구성되며, 8개의 프로그래밍 언어, 6개의 데이터베이스, 13개의 프레임워크를 통합하여 실제 소프트웨어 이질성을 세심하게 반영한다. 또한, 장기적 지평과 다중 구성 요소 결합을 특징으로 하는 복잡한 시스템에 맞춰 설계된 의존성 인식 하이브리드 평가 패러다임을 고안하여, 세분화되고 재현 가능한 평가를 가능하게 한다. 결정적으로, 광범위한 실험을 통해 주목할 만한 통찰을 발견했다: 최첨단 에이전트의 주요 병목은 고립된 코드 로직을 생성하는 것이 아니라, 다중 구성 요소 시스템을 성공적으로 구성하고 통합하는 데 있다. 작업 실패의 95% 이상이 에이전트가 심층 비즈니스 로직에 도달하기 전에 발생하며, 모델은 종종 과신에 빠져 기초 시스템 설정 중에 조기에 중단하거나 비효율적인 디버깅 루프에 갇힌다. 우리는 SaaSBench가 신뢰할 수 있는 시스템 수준 코딩 에이전트의 진화를 촉진하는 실용적이고 도전적인 테스트베드가 되기를 기대한다. 코드는 https://github.com/ShadeCloak/SaaSbench에서 확인할 수 있다.

English

As autonomous coding agents become capable of handling increasingly long-horizon tasks, they have gradually demonstrated the potential to complete end-to-end software development. Although existing benchmarks have recently evolved from localized code editing to from-scratch project generation, they remain confined to structurally simplified, single-stack applications. Consequently, they fail to capture the heterogeneous environments, full-stack orchestration, and system-level complexity of real enterprise Software as a Service (SaaS) systems, leaving a critical gap in assessing agents under realistic engineering constraints. To fill this gap, we introduce SaaSBench, the first benchmark designed to explore the boundaries of AI agents in enterprise SaaS engineering. Spanning 30 complex tasks across 6 SaaS domains with 5,370 validation nodes, it incorporates 8 programming languages, 6 databases, and 13 frameworks to meticulously mirror real-world software heterogeneity. Furthermore, we design a dependency-aware hybrid evaluation paradigm tailored for complex systems with long horizons and multi-component coupling, enabling fine-grained, reproducible assessment. Crucially, our extensive experiments reveal a striking insight: the primary bottleneck for state-of-the-art agents is not generating isolated code logic, but successfully configuring and integrating a multi-component system. Over 95\% of task failures occur before agents even reach deep business logic, with models often falling victim to overconfidence and prematurely halting during foundational system setup, or getting trapped in ineffective debugging loops. We hope SaaSBench serves as a practical and challenging testbed to drive the evolution of reliable, system-level coding agents. The code is available at https://github.com/ShadeCloak/SaaSbench.