DeNovoSWE：擴展長時域環境以從頭生成完整程式碼庫

摘要

隨著基於LLM的程式碼代理能力持續提升，其預期角色已從既有程式碼庫中的局部錯誤修復，擴展至根據高階規格設計並實現完整的軟體倉庫。然而，由於缺乏大規模、可驗證的完整倉庫生成資料，訓練代理完成此類長週期軟體工程任務仍相當困難。本文提出DeNovoSWE，一個大規模的完整倉庫生成資料集。DeNovoSWE包含4,818個高品質實例，每個實例要求根據文件生成完整倉庫。此資料集經由精心設計的沙盒化代理工作流程自動構建，無需人工標註即可實現可擴展的資料篩選。DeNovoSWE建構時採用了「分而治之」與「批評-修復」理念。為平衡資料品質與多樣性，我們進一步引入難度感知的軌跡過濾策略。在DeNovoSWE上微調Qwen3-30B-A3B可大幅提升長週期軟體工程性能，使其在具挑戰性的BeyondSWE-Doc2Repo基準測試中，得分從5.8%提升至47.2%。

English

As the capabilities of LLM-based code agents continue to advance, their expected role is expanding beyond localized bug fixing in existing codebases toward architecting and implementing complete software repositories from high-level specifications. However, training agents for such long-horizon software engineering tasks remains difficult due to the scarcity of large-scale, verifiable whole-repository generation data. In this paper, we introduce DeNovoSWE, a large-scale dataset for whole-repository generation. DeNovoSWE comprises 4,818 high-quality instances, where each instance requires generating a complete repository from documentation. Our dataset is automatically constructed through a carefully designed sandboxed agentic workflow, enabling scalable curation without human annotation. DeNovoSWE is constructed with "divide and conquer" and critic-repair philosophy. To balance data quality and diversity, we further introduce a difficulty-aware trajectory filtering strategy. Fine-tuning Qwen3-30B-A3B on DeNovoSWE substantially improves long-horizon SWE performance, raising its score on the challenging BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%.