基于重构的理解:面向大模型预训练的逆向软件开发流程
Understanding by Reconstruction: Reversing the Software Development Process for LLM Pretraining
March 11, 2026
作者: Zhiyuan Zeng, Yichi Zhang, Yong Shan, Kai Hua, Siyuan Fang, Zhaiyu Liu, Jiaheng Liu, Haozhe Wang, Yining Zheng, Ming Ding, Ke Shen, Ge Zhang, Wenhao Huang, Xipeng Qiu
cs.AI
摘要
儘管大型語言模型在代碼生成領域取得了顯著成就,但在處理複雜軟件工程所需的深層次、長週期推理任務時仍面臨挑戰。我們認為這種侷限性源於標準預訓練數據的本質:靜態軟件倉庫僅呈現了複雜智力過程的最終狀態,其間涉及的規劃、調試和迭代優化等中間環節已被抽象剝離。為彌合這一差距,我們提出"通過重構實現理解"的新範式。我們假設,對靜態倉庫背後潛在的行為軌跡(包括規劃、推理和調試步驟)進行逆向工程,相比單純使用原始代碼能提供更豐富的監督信號。為實現這一理念,我們開發了基於多智能體模擬的軌跡合成框架,該框架通過依賴圖和文件層級等源代碼倉庫的結構化特徵來確保重建過程的保真度。此外,為保證合成數據的邏輯嚴謹性,我們採用基於搜索的優化技術,通過迭代改進思維鏈推理來最大化真實代碼的生成概率。實驗結果表明,基於這些重建軌跡進行持續預訓練,能顯著提升Llama-3-8B模型在多項基準測試中的表現,包括長上下文理解、編碼能力及智能體行為水平。
English
While Large Language Models (LLMs) have achieved remarkable success in code generation, they often struggle with the deep, long-horizon reasoning required for complex software engineering. We attribute this limitation to the nature of standard pre-training data: static software repositories represent only the terminal state of an intricate intellectual process, abstracting away the intermediate planning, debugging, and iterative refinement. To bridge this gap, we propose a novel paradigm: understanding via reconstruction. We hypothesize that reverse-engineering the latent agentic trajectories -- the planning, reasoning, and debugging steps -- behind static repositories provides a far richer supervision signal than raw code alone. To operationalize this, we introduce a framework that synthesizes these trajectories using a multi-agent simulation. This process is grounded in the structural realities of the source repositories (e.g., dependency graphs and file hierarchies) to ensure fidelity. Furthermore, to guarantee the logical rigor of the synthetic data, we employ a search-based optimization technique that iteratively refines the Chain-of-Thought (CoT) reasoning to maximize the likelihood of the ground-truth code. Empirical results demonstrate that continuous pre-training on these reconstructed trajectories significantly enhances Llama-3-8B's performance across diverse benchmarks, including long-context understanding, coding proficiency, and agentic capabilities.