SWE-CI: 지속적 통합을 통한 코드베이스 유지 관리에서의 에이전트 능력 평가

초록

대규모 언어 모델(LLM) 기반 에이전트는 SWE-bench와 같은 벤치마크를 통해 정적 버그 수정과 같은 소프트웨어 엔지니어링 작업 자동화에서 강력한 능력을 입증했습니다. 그러나 현실 세계에서 성숙한 소프트웨어의 개발은 일반적으로 복잡한 요구사항 변경과 장기적인 기능 반복을 전제로 진행됩니다. 이는 정적이고 단일 시점의 수정 패러다임으로는 포착하기 어려운 과정입니다. 이러한 격차를 해소하기 위해 우리는 지속적 통합(Continuous Integration) 루프를 기반으로 구축된 최초의 저장소 수준 벤치마크인 SWE-CI를 제안합니다. 이를 통해 코드 생성에 대한 평가 패러다임을 정적이고 단기적인 기능적 정확성에서 동적이고 장기적인 유지보수성으로 전환하고자 합니다. 해당 벤치마크는 100개의 작업으로 구성되며, 각 작업은 평균 233일에 걸친 진화 기록과 실제 코드 저장소의 71개의 연속 커밋에 대응됩니다. SWE-CI는 에이전트가 수십 차례의 분석 및 코딩 반복을 통해 이러한 작업들을 체계적으로 해결하도록 요구합니다. SWE-CI는 에이전트가 장기적인 진화 과정 전반에 걸쳐 코드 품질을 얼마나 잘 유지할 수 있는지에 대한 유용한 통찰력을 제공합니다.

English

Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing, as evidenced by benchmarks like SWE-bench. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose SWE-CI, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term functional correctness toward dynamic, long-term maintainability. The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository. SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. SWE-CI provides valuable insights into how well agents can sustain code quality throughout long-term evolution.

SWE-CI: 지속적 통합을 통한 코드베이스 유지 관리에서의 에이전트 능력 평가

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

초록

Support