ABC-Bench：面向真实世界开发的智能体后端编码基准测试

摘要

大型语言模型（LLMs）向自主智能体的演进，已将AI编程的范畴从局部代码生成扩展至复杂的仓库级、执行驱动型问题求解。然而，现有基准测试主要针对静态场景下的代码逻辑评估，忽视了实际工程中动态的全流程需求——尤其是后端开发所必需的环境配置与服务部署等严苛要求。为弥补这一空白，我们推出ABC-Bench基准测试，其专为在可执行工作流中评估智能体后端编程能力而设计。通过可扩展的自动化流水线，我们从开源仓库中筛选出涵盖8种编程语言和19种框架的224项实践任务。与既往评估不同，ABC-Bench要求智能体管理从仓库探索到容器化服务实例化的完整开发生命周期，并通过外部端到端API测试。大规模评估表明，即使最先进的模型在此类全局性任务中也难以保持稳定性能，这揭示了当前模型能力与实际后端工程需求之间的显著差距。代码已开源：https://github.com/OpenMOSS/ABC-Bench。

English

The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the dynamic, full-process requirements of real-world engineering, particularly in backend development which demands rigorous environment configuration and service deployment. To address this gap, we introduce ABC-Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. Using a scalable automated pipeline, we curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories. Distinct from previous evaluations, ABC-Bench require the agents to manage the entire development lifecycle from repository exploration to instantiating containerized services and pass the external end-to-end API tests. Our extensive evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks, highlighting a substantial disparity between current model capabilities and the demands of practical backend engineering. Our code is available at https://github.com/OpenMOSS/ABC-Bench.

ABC-Bench：面向真实世界开发的智能体后端编码基准测试

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

摘要

Support