daVinci-Dev：面向软件工程的智能体原生中期训练

摘要

近年来，大语言模型（LLM）能力的前沿已从单轮代码生成转向智能体式软件工程——这一范式要求模型能自主导航、编辑和测试复杂代码库。虽然训练后增强方法已成为代码智能体的主流方案，但**智能体中期训练**（即在模拟真实智能体工作流程的大规模数据上进行中期训练）虽比单纯依赖昂贵的强化学习更具可扩展性，能更有效地培养基础智能体行为，却因资源需求巨大而尚未得到充分探索。实现有效智能体中期训练的核心挑战在于静态训练数据与真实开发中动态、富含反馈环境之间的分布失配。为此，我们系统性研究了智能体中期训练，建立了适用于大规模智能体开发的数据合成原则与训练方法。我们方法的核心在于**智能体原生数据**——包含两种互补轨迹的监督数据：**上下文原生轨迹**完整保留智能体经历的信息流，提供广泛覆盖度和多样性；**环境原生轨迹**采集自可执行代码库，其观测结果源自实际工具调用和测试执行，确保交互深度与真实性。我们在`SWE-Bench Verified`上验证模型的智能体能力。实验表明，在采用对齐基座模型与智能体框架的两种训练后设置下，我们的方法以不足半数中期训练词元（731亿）优于此前开源的软件工程中期训练方案`Kimi-Dev`。除相对优势外，我们性能最佳的320亿和720亿参数模型分别达到**56.1%** 和**58.5%** 的问题解决率，这一表现...

English

Recently, the frontier of Large Language Model (LLM) capabilities has shifted from single-turn code generation to agentic software engineering-a paradigm where models autonomously navigate, edit, and test complex repositories. While post-training methods have become the de facto approach for code agents, **agentic mid-training**-mid-training (MT) on large-scale data that mirrors authentic agentic workflows-remains critically underexplored due to substantial resource requirements, despite offering a more scalable path to instilling foundational agentic behaviors than relying solely on expensive reinforcement learning. A central challenge in realizing effective agentic mid-training is the distribution mismatch between static training data and the dynamic, feedback-rich environment of real development. To address this, we present a systematic study of agentic mid-training, establishing both the data synthesis principles and training methodology for effective agent development at scale. Central to our approach is **agent-native data**-supervision comprising two complementary types of trajectories: **contextually-native trajectories** that preserve the complete information flow an agent experiences, offering broad coverage and diversity; and **environmentally-native trajectories** collected from executable repositories where observations stem from actual tool invocations and test executions, providing depth and interaction authenticity. We verify the model's agentic capabilities on `SWE-Bench Verified`. We demonstrate our superiority over the previous open software engineering mid-training recipe `Kimi-Dev` under two post-training settings with an aligned base model and agentic scaffold, while using less than half mid-training tokens (73.1B). Besides relative advantage, our best performing 32B and 72B models achieve **56.1%** and **58.5%** resolution rates, respectively, which are ...

daVinci-Dev：面向软件工程的智能体原生中期训练

daVinci-Dev: Agent-native Mid-training for Software Engineering

摘要

Support