SWE-WebDevBench: コーディングエージェントアプリケーションプラットフォームを仮想ソフトウェアエージェンシーとして評価する

要旨

自然言語でアプリケーションを記述するとAIエージェントが自律的にフルスタックソフトウェアを生成する「Vibe Coding」プラットフォームの登場により、コードレベルのベンチマークを超えた厳密な評価基準が必要とされている。本論文では、これらのプラットフォームを仮想ソフトウェア開発機関として、ビジネス要件の理解、アーキテクチャ決定、プロダクションコードの作成、反復的変更の対応、ビジネス対応性の維持という観点から評価するため、SWE-WebDev Benchを提案する。これは68の評価指標から成るフレームワークであり、3つの次元——インタラクションモード（アプリ作成要求（ACR）対アプリ変更要求（AMR））、エージェンシー角度（プロダクトマネージャー（PM）、エンジニアリング、オペレーション）、複雑性階層（T4 マルチロールSaaS、T5 AIネイティブ）——に沿って編成された7グループ25の主要指標と43の診断指標で構成される。評価結果（6プラットフォーム、3ドメイン、18評価セル）から、現世代のAIアプリビルダーには4つの反復的欠陥が明らかになった：（1）豊富なビジネス要件が過度に単純化された技術計画に圧縮される「仕様ボトルネック」、（2）視覚的に洗練されたUIがバックエンドインフラの欠如や不具合を隠蔽する「フロントエンド-バックエンドの分離の蔓延」、（3）エンジニアリング品質で60%を超えるプラットフォームがなく、生成後の人的作業負荷がプラットフォーム間で大幅に変動する「急峻なプロダクション対応性の崖」、（4）90%目標に対しセキュリティスコア65%を超えるプラットフォームがなく、同時実行性対応が6%に留まるなど「広範なセキュリティ・インフラ障害」である。これらの知見は当該サンプルに特有のものであり、一般性を立証するには大規模な再現検証が必要である。SWE-WebDev Benchをコミュニティベンチマークとして公開し、こうした再現検証を可能にするとともに、プラットフォーム開発者がこれらの課題を特定・解決することを支援する。コードおよびベンチマークリソースは以下で公開されている： https://github.com/snowmountainAi/webdevbench https://webdevbench.com/

English

The emergence of "vibe coding" platforms, where users describe applications in natural language and AI agents autonomously generate full-stack software, has created a need for rigorous evaluation beyond code-level benchmarks. In order to assess them as virtual software development agencies on understanding business requirements, making architectural decisions, writing production code, handling iterative modifications, and maintaining business readiness, we introduce SWE-WebDev Bench, a 68-metric evaluation framework spanning 25 primary and 43 diagnostic metrics across seven groups, organized along three dimensions: Interaction Mode (App Creation Request (ACR) vs. App Modification Request (AMR)), Agency Angle (Product Manager (PM), Engineering, Ops), and Complexity Tier (T4 multi-role SaaS, T5 AI-native). Our evaluation (six platforms, three domains, 18 evaluation cells) reveals four recurring shortcomings in the current generation of AI app builders: (1) A specification bottleneck, where platforms compress rich business requirements into oversimplified technical plans, (2) A pervasive frontend-backend decoupling, where visually polished UIs mask absent or broken backend infrastructure, (3) A steep production-readiness cliff, where no platform scores above 60% on engineering quality and post-generation human effort varies substantially across platforms and (4) Widespread security and infrastructure failures, with no platform exceeding 65% Security Score against a 90% target and concurrency handling as low as 6%. These observations are descriptive of our sample and require larger-scale replication to establish generality. We release SWE-WebDev Bench as a community benchmark to enable such replication and help platform builders identify and address these gaps. Code and benchmark resources are available at: https://github.com/snowmountainAi/webdevbench and https://webdevbench.com/.

SWE-WebDevBench: コーディングエージェントアプリケーションプラットフォームを仮想ソフトウェアエージェンシーとして評価する

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

要旨

Support