ABC-Bench：実世界開発におけるエージェント型バックエンドコーディングのベンチマーク

要旨

大規模言語モデル（LLM）の自律エージェント化への進化に伴い、AIによるコーディングの範囲は、局所的なコード生成から、リポジトリレベルかつ実行駆動型の複雑な問題解決へと拡大している。しかし、現在のベンチマークは静的コンテキストにおけるコード論理の評価に偏っており、実世界のエンジニアリング、特に厳密な環境設定とサービスデプロイを要求されるバックエンド開発において必要とされる、動的かつ全プロセスにわたる要件を見落としている。この課題を解決するため、我々は現実的な実行可能ワークフロー内におけるエージェント的バックエンドコーディングを評価するために明示的に設計されたベンチマーク「ABC-Bench」を提案する。スケーラブルな自動化パイプラインを用いて、オープンソースリポジトリから8言語・19フレームワークにわたる224の実践的タスクを精選した。従来の評価と異なり、ABC-Benchはエージェントがリポジトリ探索からコンテナ化サービス実装までの開発ライフサイクル全体を管理し、外部エンドツーエンドAPIテストを通過することを要求する。大規模評価の結果、最先端モデルであってもこれらの総合的タスクで信頼性の高い性能を発揮することに苦戦しており、現在のモデル能力と実践的バックエンドエンジニアリングの要求との間に大きな隔たりがあることが明らかになった。コードはhttps://github.com/OpenMOSS/ABC-Bench で公開されている。

English

The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the dynamic, full-process requirements of real-world engineering, particularly in backend development which demands rigorous environment configuration and service deployment. To address this gap, we introduce ABC-Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. Using a scalable automated pipeline, we curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories. Distinct from previous evaluations, ABC-Bench require the agents to manage the entire development lifecycle from repository exploration to instantiating containerized services and pass the external end-to-end API tests. Our extensive evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks, highlighting a substantial disparity between current model capabilities and the demands of practical backend engineering. Our code is available at https://github.com/OpenMOSS/ABC-Bench.

ABC-Bench：実世界開発におけるエージェント型バックエンドコーディングのベンチマーク

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

要旨

Support