SaaSBench: 長期的なエンタープライズSaaSエンジニアリングにおけるコーディングエージェントの限界を探る

要旨

自律型コーディングエージェントが長期にわたるタスクを処理できるようになるにつれて、エンドツーエンドのソフトウェア開発を完遂する可能性を徐々に示しつつある。既存のベンチマークは、最近では局所的なコード編集からスクラッチでのプロジェクト生成へと進化しているものの、依然として構造的に単純化された単一スタックのアプリケーションに限定されている。その結果、実際のエンタープライズ向けSaaS（サービスとしてのソフトウェア）システムにおける異種環境、フルスタックオーケストレーション、システムレベルの複雑性を捉えきれておらず、現実的な工学的制約下でのエージェント評価に重大なギャップが残されている。このギャップを埋めるべく、我々はSaaSBenchを導入する。これは、エンタープライズSaaSエンジニアリングにおけるAIエージェントの限界を探求する初のベンチマークである。6つのSaaSドメインにわたる30の複雑なタスクと5,370の検証ノードから構成され、8つのプログラミング言語、6つのデータベース、13のフレームワークを組み込むことで、現実世界のソフトウェアの多様性を精緻に再現している。さらに、長期スパンと多コンポーネント連携を特徴とする複雑システム向けに、依存関係を考慮したハイブリッド評価パラダイムを設計し、粒度が細かく再現性のある評価を可能にした。重要な点として、広範な実験から顕著な知見が得られた。最先端のエージェントにとっての主要なボトルネックは、コードロジックの独立した生成ではなく、マルチコンポーネントシステムの設定と統合を成功させることにある。タスクの失敗の95％以上は、エージェントが深いビジネスロジックに到達する以前に発生しており、モデルは過信に陥り、基盤システムのセットアップ中に早期に停止するか、非効率なデバッグループに陥りがちである。SaaSBenchが、信頼性の高いシステムレベルのコーディングエージェントの進化を促進する、実用的で挑戦的なテストベッドとなることを期待する。コードはhttps://github.com/ShadeCloak/SaaSbenchで公開されている。

English

As autonomous coding agents become capable of handling increasingly long-horizon tasks, they have gradually demonstrated the potential to complete end-to-end software development. Although existing benchmarks have recently evolved from localized code editing to from-scratch project generation, they remain confined to structurally simplified, single-stack applications. Consequently, they fail to capture the heterogeneous environments, full-stack orchestration, and system-level complexity of real enterprise Software as a Service (SaaS) systems, leaving a critical gap in assessing agents under realistic engineering constraints. To fill this gap, we introduce SaaSBench, the first benchmark designed to explore the boundaries of AI agents in enterprise SaaS engineering. Spanning 30 complex tasks across 6 SaaS domains with 5,370 validation nodes, it incorporates 8 programming languages, 6 databases, and 13 frameworks to meticulously mirror real-world software heterogeneity. Furthermore, we design a dependency-aware hybrid evaluation paradigm tailored for complex systems with long horizons and multi-component coupling, enabling fine-grained, reproducible assessment. Crucially, our extensive experiments reveal a striking insight: the primary bottleneck for state-of-the-art agents is not generating isolated code logic, but successfully configuring and integrating a multi-component system. Over 95\% of task failures occur before agents even reach deep business logic, with models often falling victim to overconfidence and prematurely halting during foundational system setup, or getting trapped in ineffective debugging loops. We hope SaaSBench serves as a practical and challenging testbed to drive the evolution of reliable, system-level coding agents. The code is available at https://github.com/ShadeCloak/SaaSbench.