ABC-Bench: 실제 개발 환경에서의 에이전트 기반 백엔드 코딩 성능 평가

초록

대규모 언어 모델(LLM)이 자율 에이전트로 진화하면서 AI 코딩의 범위는 지역적인 코드 생성에서 복잡한 저장소 수준의 실행 중심 문제 해결로 확장되었습니다. 그러나 현재 벤치마크는 정적 맥락에서 코드 논리를 평가하는 데 주력하여, 실제 엔지니어링, 특히 엄격한 환경 구성과 서비스 배포를 요구하는 백엔드 개발의 동적이고 전 과정적인 요구사항을 간과하고 있습니다. 이러한 격차를 해결하기 위해 우리는 현실적이고 실행 가능한 워크플로우 내에서 에이전트의 백엔드 코딩 능력을 명시적으로 평가하도록 설계된 벤치마크인 ABC-Bench를 소개합니다. 확장 가능한 자동화 파이프라인을 활용하여 오픈소스 저장소로부터 8개 프로그래밍 언어와 19개 프레임워크에 걸친 224개의 실용적인 과제를 선별했습니다. 기존 평가와 달리, ABC-Bench는 에이전트가 저장소 탐색부터 컨테이너화된 서비스 인스턴스 생성에 이르는 전체 개발 생명주기를 관리하고 외부 종단간 API 테스트를 통과하도록 요구합니다. 우리의 포괄적인 평가 결과, 최첨단 모델이라도 이러한 전체론적 과제에서 신뢰할 수 있는 성능을 내는 데 어려움을 겪는 것으로 나타나, 현재 모델의 능력과 실전 백엔드 엔지니어링의 요구 사항 사이에 상당한 격차가 있음을 확인했습니다. 우리의 코드는 https://github.com/OpenMOSS/ABC-Bench에서 확인할 수 있습니다.

English

The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the dynamic, full-process requirements of real-world engineering, particularly in backend development which demands rigorous environment configuration and service deployment. To address this gap, we introduce ABC-Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. Using a scalable automated pipeline, we curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories. Distinct from previous evaluations, ABC-Bench require the agents to manage the entire development lifecycle from repository exploration to instantiating containerized services and pass the external end-to-end API tests. Our extensive evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks, highlighting a substantial disparity between current model capabilities and the demands of practical backend engineering. Our code is available at https://github.com/OpenMOSS/ABC-Bench.

ABC-Bench: 실제 개발 환경에서의 에이전트 기반 백엔드 코딩 성능 평가

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

초록

Support