ChatPaper.aiChatPaper

SWE-WebDevBench:将代码代理应用平台作为虚拟软件机构进行评估

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

May 6, 2026
作者: Siddhant Saxena, Nilesh Trivedi, Vinayaka Jyothi
cs.AI

摘要

随着"氛围编程"平台的兴起——用户仅需用自然语言描述应用需求,AI代理即可自主生成全栈软件——传统代码级评估标准已无法满足需求。为将这些AI平台作为虚拟软件开发机构进行综合评估,涵盖业务需求理解、架构决策、生产代码编写、迭代修改处理及业务就绪度维护等维度,我们推出SWE-WebDev Bench评估框架。该框架包含68项指标,从三个维度(交互模式:应用创建请求ACR vs 应用修改请求AMR;机构视角:产品经理PM/工程/运维;复杂度层级:T4多角色SaaS/T5 AI原生)划分为7组,共包含25项核心指标与43项诊断指标。 通过对6大平台、3大领域、18个评估单元的测试,我们发现当前AI应用构建平台存在四大共性缺陷:(1)需求规约瓶颈:平台将丰富的业务需求压缩为过度简化的技术方案;(2)前后端普遍脱节:视觉精美的UI界面背后存在缺失或故障的后端基础设施;(3)生产就绪度断崖:工程质量评分无平台超过60%,且不同平台所需后期人工投入差异显著;(4)安全与基础设施普遍故障:安全评分最高平台未达65%(目标值90%),并发处理能力低至6%。这些发现基于当前样本,需更大规模验证以确立普适性。我们开源SWE-WebDev Bench作为社区基准测试工具,旨在促进验证研究并帮助平台开发者识别弥补这些缺陷。 代码与基准资源详见:https://github.com/snowmountainAi/webdevbench 及 https://webdevbench.com/。
English
The emergence of "vibe coding" platforms, where users describe applications in natural language and AI agents autonomously generate full-stack software, has created a need for rigorous evaluation beyond code-level benchmarks. In order to assess them as virtual software development agencies on understanding business requirements, making architectural decisions, writing production code, handling iterative modifications, and maintaining business readiness, we introduce SWE-WebDev Bench, a 68-metric evaluation framework spanning 25 primary and 43 diagnostic metrics across seven groups, organized along three dimensions: Interaction Mode (App Creation Request (ACR) vs. App Modification Request (AMR)), Agency Angle (Product Manager (PM), Engineering, Ops), and Complexity Tier (T4 multi-role SaaS, T5 AI-native). Our evaluation (six platforms, three domains, 18 evaluation cells) reveals four recurring shortcomings in the current generation of AI app builders: (1) A specification bottleneck, where platforms compress rich business requirements into oversimplified technical plans, (2) A pervasive frontend-backend decoupling, where visually polished UIs mask absent or broken backend infrastructure, (3) A steep production-readiness cliff, where no platform scores above 60% on engineering quality and post-generation human effort varies substantially across platforms and (4) Widespread security and infrastructure failures, with no platform exceeding 65% Security Score against a 90% target and concurrency handling as low as 6%. These observations are descriptive of our sample and require larger-scale replication to establish generality. We release SWE-WebDev Bench as a community benchmark to enable such replication and help platform builders identify and address these gaps. Code and benchmark resources are available at: https://github.com/snowmountainAi/webdevbench and https://webdevbench.com/.
PDF21May 8, 2026