DeNovoSWE: 처음부터 전체 저장소 생성을 위한 장기 지평 환경 확장

초록

LLM 기반 코드 에이전트의 역량이 지속적으로 발전함에 따라, 이들의 예상 역할은 기존 코드베이스의 국소적 버그 수정을 넘어, 고수준 명세로부터 완전한 소프트웨어 저장소를 설계하고 구현하는 방향으로 확장되고 있다. 그러나 이러한 장기적 소프트웨어 엔지니어링 작업을 위한 에이전트 학습은, 대규모로 검증 가능한 전체 저장소 생성 데이터가 부족하다는 어려움을 여전히 안고 있다. 본 논문에서는 전체 저장소 생성용 대규모 데이터셋인 DeNovoSWE를 소개한다. DeNovoSWE는 4,818개의 고품질 인스턴스로 구성되며, 각 인스턴스는 문서로부터 완전한 저장소를 생성해야 하는 과제를 포함한다. 본 데이터셋은 신중하게 설계된 샌드박스 기반 에이전트 워크플로우를 통해 자동으로 구축되어, 인간의 주석 없이도 확장 가능한 큐레이션이 가능하다. DeNovoSWE는 "분할 정복(divide and conquer)"과 비판-수리(critic-repair) 철학에 기반하여 구축되었다. 데이터 품질과 다양성 간의 균형을 맞추기 위해, 난이도를 고려한 궤적 필터링 전략을 추가로 도입하였다. DeNovoSWE로 Qwen3-30B-A3B를 미세 조정한 결과, 장기적 SWE 성능이 크게 향상되어, 까다로운 BeyondSWE-Doc2Repo 벤치마크에서의 점수가 5.8%에서 47.2%로 상승하였다.

English

As the capabilities of LLM-based code agents continue to advance, their expected role is expanding beyond localized bug fixing in existing codebases toward architecting and implementing complete software repositories from high-level specifications. However, training agents for such long-horizon software engineering tasks remains difficult due to the scarcity of large-scale, verifiable whole-repository generation data. In this paper, we introduce DeNovoSWE, a large-scale dataset for whole-repository generation. DeNovoSWE comprises 4,818 high-quality instances, where each instance requires generating a complete repository from documentation. Our dataset is automatically constructed through a carefully designed sandboxed agentic workflow, enabling scalable curation without human annotation. DeNovoSWE is constructed with "divide and conquer" and critic-repair philosophy. To balance data quality and diversity, we further introduce a difficulty-aware trajectory filtering strategy. Fine-tuning Qwen3-30B-A3B on DeNovoSWE substantially improves long-horizon SWE performance, raising its score on the challenging BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%.