SaaSBench：探索編碼代理在長程企業級SaaS工程中的邊界

摘要

隨著自主編碼代理能夠處理越來越長期的任務，它們已逐漸展現出完成端到端軟體開發的潛力。雖然現有基準測試近期已從局部程式碼編輯演進到從零開始的專案生成，但它們仍僅限於結構簡化的單一架構應用。因此，這些基準無法捕捉真實企業軟體即服務（SaaS）系統中的異質環境、全端協調及系統級複雜度，這在評估代理在現實工程限制下的表現時留下了一個關鍵缺口。為填補此缺口，我們提出 SaaSBench，這是首個旨在探索 AI 代理在企業 SaaS 工程中邊界的基準測試。它涵蓋 6 個 SaaS 領域中的 30 項複雜任務，共 5,370 個驗證節點，並整合了 8 種程式語言、6 種資料庫及 13 種框架，細緻地反映真實軟體的異質性。此外，我們設計了一套專為長期期、多元件耦合的複雜系統量身打造的依賴感知混合評估範式，以實現細粒度且可重現的評估。關鍵的是，我們的大量實驗揭示了一個驚人洞察：當前最先進代理的主要瓶頸並非生成孤立的程式碼邏輯，而是成功配置與整合一個多元件系統。超過 95% 的任務失敗發生在代理甚至尚未觸及深層業務邏輯之前，模型往往因過度自信而在基礎系統設置階段過早停止，或陷入無效的除錯循環。我們希望 SaaSBench 能作為一個實用且具挑戰性的測試平台，推動可靠、系統級程式碼代理的演進。程式碼已於 https://github.com/ShadeCloak/SaaSbench 公開。

English

As autonomous coding agents become capable of handling increasingly long-horizon tasks, they have gradually demonstrated the potential to complete end-to-end software development. Although existing benchmarks have recently evolved from localized code editing to from-scratch project generation, they remain confined to structurally simplified, single-stack applications. Consequently, they fail to capture the heterogeneous environments, full-stack orchestration, and system-level complexity of real enterprise Software as a Service (SaaS) systems, leaving a critical gap in assessing agents under realistic engineering constraints. To fill this gap, we introduce SaaSBench, the first benchmark designed to explore the boundaries of AI agents in enterprise SaaS engineering. Spanning 30 complex tasks across 6 SaaS domains with 5,370 validation nodes, it incorporates 8 programming languages, 6 databases, and 13 frameworks to meticulously mirror real-world software heterogeneity. Furthermore, we design a dependency-aware hybrid evaluation paradigm tailored for complex systems with long horizons and multi-component coupling, enabling fine-grained, reproducible assessment. Crucially, our extensive experiments reveal a striking insight: the primary bottleneck for state-of-the-art agents is not generating isolated code logic, but successfully configuring and integrating a multi-component system. Over 95\% of task failures occur before agents even reach deep business logic, with models often falling victim to overconfidence and prematurely halting during foundational system setup, or getting trapped in ineffective debugging loops. We hope SaaSBench serves as a practical and challenging testbed to drive the evolution of reliable, system-level coding agents. The code is available at https://github.com/ShadeCloak/SaaSbench.