SWE-CI: 継続的インテグレーションによるコードベース保守におけるエージェント能力の評価

要旨

大規模言語モデル（LLM）を活用したエージェントは、SWE-benchなどのベンチマークが示すように、静的バグ修正といったソフトウェア工学タスクの自動化において強力な能力を発揮している。しかし、現実世界では、成熟したソフトウェアの開発は通常、複雑な要求変更と長期的な機能イテレーションに基づいて進められる。これは、静的で単発的な修正パラダイムでは捉えきれないプロセスである。この隔たりを埋めるため、我々は継続的インテグレーション（CI）ループに基づく初のリポジトリレベルベンチマークであるSWE-CIを提案する。本ベンチマークは、コード生成の評価パラダイムを、静的・短期的な機能的正しさから、動的・長期的な保守性へと転換することを目的としている。このベンチマークは100のタスクで構成され、各タスクは平均233日間・71連続コミットにわたる実世界のコードリポジトリの進化履歴に対応する。SWE-CIでは、エージェントが数十回に及ぶ分析とコーディングのイテレーションを通じて、これらのタスクを体系的に解決することが求められる。SWE-CIは、エージェントが長期的な進化を通じてコード品質をどの程度維持できるかについて、貴重な知見を提供する。

English

Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing, as evidenced by benchmarks like SWE-bench. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose SWE-CI, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term functional correctness toward dynamic, long-term maintainability. The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository. SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. SWE-CI provides valuable insights into how well agents can sustain code quality throughout long-term evolution.

SWE-CI: 継続的インテグレーションによるコードベース保守におけるエージェント能力の評価

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

要旨

Support