ABC-Bench：真實世界開發環境中的智能後端編碼基準測試

摘要

大型語言模型（LLMs）向自主智能體的演進，已將AI編程的範疇從局部程式碼生成擴展至複雜的倉庫級別、執行驅動的問題解決。然而，現有基準測試主要針對靜態情境下的程式邏輯進行評估，忽略了真實工程中動態的全流程需求，尤其後端開發需嚴格環境配置與服務部署的場景。為填補此空白，我們推出ABC-Bench——專為在可執行工作流中評估智能體後端編碼能力而設計的基準測試。透過可擴展的自動化流水線，我們從開源倉庫精選涵蓋8種程式語言與19種框架的224項實務任務。有別於過往評估，ABC-Bench要求智能體管理從倉庫探索到容器化服務實例化的完整開發生命週期，並通過外部端到端API測試。大規模評估顯示，即使是頂尖模型在此類全流程任務中也難以穩定發揮，凸顯當前模型能力與實務後端工程需求間的顯著差距。本項目代碼公開於：https://github.com/OpenMOSS/ABC-Bench。

English

The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the dynamic, full-process requirements of real-world engineering, particularly in backend development which demands rigorous environment configuration and service deployment. To address this gap, we introduce ABC-Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. Using a scalable automated pipeline, we curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories. Distinct from previous evaluations, ABC-Bench require the agents to manage the entire development lifecycle from repository exploration to instantiating containerized services and pass the external end-to-end API tests. Our extensive evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks, highlighting a substantial disparity between current model capabilities and the demands of practical backend engineering. Our code is available at https://github.com/OpenMOSS/ABC-Bench.

ABC-Bench：真實世界開發環境中的智能後端編碼基準測試

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

摘要

Support