重构式理解：面向大模型预训练的反向软件开发流程

摘要

尽管大语言模型在代码生成领域取得了显著成就，但在处理复杂软件工程所需的深度长程推理时仍存在局限。我们认为这一局限源于标准预训练数据的特性：静态软件仓库仅呈现了复杂智力过程的最终状态，而忽略了其中的规划、调试与迭代优化等中间环节。为弥补这一差距，我们提出一种新颖范式：通过重构实现理解。我们假设，对静态仓库背后潜在的智能轨迹——包括规划、推理与调试步骤——进行逆向工程，能比单纯使用原始代码提供更丰富的监督信号。为实现这一目标，我们引入基于多智能体模拟的轨迹合成框架，该框架通过依赖图与文件层级等仓库结构特征确保过程真实性。此外，为保证合成数据的逻辑严谨性，我们采用基于搜索的优化技术，通过迭代优化思维链推理以最大化真实代码的生成概率。实验结果表明，基于这些重构轨迹的持续预训练显著提升了Llama-3-8B模型在长上下文理解、编程能力及智能体任务等多项基准测试中的表现。

English

While Large Language Models (LLMs) have achieved remarkable success in code generation, they often struggle with the deep, long-horizon reasoning required for complex software engineering. We attribute this limitation to the nature of standard pre-training data: static software repositories represent only the terminal state of an intricate intellectual process, abstracting away the intermediate planning, debugging, and iterative refinement. To bridge this gap, we propose a novel paradigm: understanding via reconstruction. We hypothesize that reverse-engineering the latent agentic trajectories -- the planning, reasoning, and debugging steps -- behind static repositories provides a far richer supervision signal than raw code alone. To operationalize this, we introduce a framework that synthesizes these trajectories using a multi-agent simulation. This process is grounded in the structural realities of the source repositories (e.g., dependency graphs and file hierarchies) to ensure fidelity. Furthermore, to guarantee the logical rigor of the synthetic data, we employ a search-based optimization technique that iteratively refines the Chain-of-Thought (CoT) reasoning to maximize the likelihood of the ground-truth code. Empirical results demonstrate that continuous pre-training on these reconstructed trajectories significantly enhances Llama-3-8B's performance across diverse benchmarks, including long-context understanding, coding proficiency, and agentic capabilities.