從程式碼基礎模型到智慧體與應用:程式碼智慧實務指南
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
November 23, 2025
作者: Jian Yang, Xianglong Liu, Weifeng Lv, Ken Deng, Shawn Guo, Lin Jing, Yizhi Li, Shark Liu, Xianzhen Luo, Yuyu Luo, Changzai Pan, Ensheng Shi, Yingshui Tan, Renshuai Tao, Jiajun Wu, Xianjie Wu, Zhenhe Wu, Daoguang Zan, Chenchen Zhang, Wei Zhang, He Zhu, Terry Yue Zhuo, Kerui Cao, Xianfu Cheng, Jun Dong, Shengjie Fang, Zhiwei Fei, Xiangyuan Guan, Qipeng Guo, Zhiguang Han, Joseph James, Tianqi Luo, Renyuan Li, Yuhang Li, Yiming Liang, Congnan Liu, Jiaheng Liu, Qian Liu, Ruitong Liu, Tyler Loakman, Xiangxin Meng, Chuang Peng, Tianhao Peng, Jiajun Shi, Mingjie Tang, Boyang Wang, Haowen Wang, Yunli Wang, Fanglin Xu, Zihan Xu, Fei Yuan, Ge Zhang, Jiayi Zhang, Xinhao Zhang, Wangchunshu Zhou, Hualei Zhu, King Zhu, Brown Dai, Aishan Liu, Zhoujun Li, Chenghua Lin, Tianyu Liu, Chao Peng, Kai Shen, Libo Qin, Shuangyong Song, Zizheng Zhan, Jiajun Zhang, Jie Zhang, Zhaoxiang Zhang, Bo Zheng
cs.AI
摘要
大型語言模型(LLMs)已從根本上改變了自動化軟體開發的格局,實現了將自然語言描述直接轉譯為功能性程式碼的能力,並透過GitHub Copilot(微軟)、Cursor(Anysphere)、Trae(字節跳動)和Claude Code(Anthropic)等工具推動商業應用。儘管該領域已從基於規則的系統演進至Transformer架構為主導,在HumanEval等基準測試上的成功率從個位數提升至超過95%,但技術發展仍面臨諸多挑戰。本文針對程式碼大型語言模型提出系統性的綜述與實踐指南(包含一系列分析性與探測性實驗),從資料治理到後訓練階段完整審視模型生命週期,涵蓋進階提示範式、程式碼預訓練、監督式微調、強化學習及自主編碼代理等關鍵環節。我們深入分析通用大型語言模型(GPT-4、Claude、LLaMA)與專用程式碼模型(StarCoder、Code LLaMA、DeepSeek-Coder、QwenCoder)的程式碼能力,批判性檢視其技術實現、設計決策與權衡取捨。進一步地,我們闡明學術研究(如基準測試與任務設計)與實際部署(如軟體相關程式碼任務)之間的研究實踐落差,包括程式碼正確性、安全性、大型程式碼庫的上下文感知能力,以及與開發工作流的整合,並將具潛力的研究方向對應至實際需求。最後,我們透過系列實驗對程式碼預訓練、監督式微調與強化學習進行全面分析,涵蓋規模化規律、框架選擇、超參數敏感性、模型架構及資料集比較等維度。
English
Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.