DeNovoSWE: 扩展长程环境以从零生成完整代码仓库

摘要

随着基于大语言模型的代码智能体能力不断提升，其预期角色正从现有代码库中的局部缺陷修复，扩展到根据高层级规范构建并实现完整的软件仓库。然而，由于缺乏大规模、可验证的完整仓库生成数据，针对这类长周期软件工程任务训练智能体仍具挑战。本文提出DeNovoSWE——一个用于完整仓库生成的大规模数据集。该数据集包含4,818个高质量实例，每个实例需根据文档生成完整的仓库。我们通过精心设计的沙盒化智能体工作流自动构建该数据集，无需人工标注即可实现可扩展的数据策展。DeNovoSWE的构建遵循"分而治之"与"批评-修复"理念。为平衡数据质量与多样性，我们进一步引入了难度感知的轨迹过滤策略。在DeNovoSWE上微调Qwen3-30B-A3B模型显著提升了其在长周期软件工程任务上的性能，在具有挑战性的BeyondSWE-Doc2Repo基准测试中，得分从5.8%提升至47.2%。

English

As the capabilities of LLM-based code agents continue to advance, their expected role is expanding beyond localized bug fixing in existing codebases toward architecting and implementing complete software repositories from high-level specifications. However, training agents for such long-horizon software engineering tasks remains difficult due to the scarcity of large-scale, verifiable whole-repository generation data. In this paper, we introduce DeNovoSWE, a large-scale dataset for whole-repository generation. DeNovoSWE comprises 4,818 high-quality instances, where each instance requires generating a complete repository from documentation. Our dataset is automatically constructed through a carefully designed sandboxed agentic workflow, enabling scalable curation without human annotation. DeNovoSWE is constructed with "divide and conquer" and critic-repair philosophy. To balance data quality and diversity, we further introduce a difficulty-aware trajectory filtering strategy. Fine-tuning Qwen3-30B-A3B on DeNovoSWE substantially improves long-horizon SWE performance, raising its score on the challenging BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%.