コード基盤モデルからエージェントとアプリケーションへ：コード知能の実践ガイド

要旨

大規模言語モデル（LLM）は、自然言語による記述を機能的なコードへ直接変換することを可能にし、自動化されたソフトウェア開発を根本的に変革しました。この進化は、Github Copilot（Microsoft）、Cursor（Anysphere）、Trae（ByteDance）、Claude Code（Anthropic）といったツールを通じて商業的な採用を促進しています。本分野は、ルールベースシステムからTransformerベースのアーキテクチャへと劇的に発展し、HumanEvalなどのベンチマークにおいて成功率を一桁から95％超へと飛躍的に向上させてきました。本研究では、コードLLMに関する体系的な総括と実践的ガイド（一連の分析・検証実験）を提供し、データキュレーションから高度なプロンプティング手法、コード事前学習、教師ありファインチューニング、強化学習、自律的コーディングエージェントを経るまでの完全なモデルライフサイクルを体系的に検証します。汎用LLM（GPT-4、Claude、LLaMA）とコード特化型LLM（StarCoder、Code LLaMA、DeepSeek-Coder、QwenCoder）のコード能力を分析し、技術的アプローチ、設計判断、トレードオフを批判的に検討します。さらに、学術研究（ベンチマークや課題）と実世界での展開（ソフトウェア関連のコードタスク）の間にある研究と実践の隔たり——コードの正確性、セキュリティ、大規模コードベースへの文脈理解、開発ワークフローとの統合などを含む——を明確にし、有望な研究方向性を実用的なニーズに対応づけます。最後に、スケーリング則、フレームワーク選択、ハイパーパラメータ感応性、モデルアーキテクチャ、データセット比較を網羅する、コード事前学習、教師ありファインチューニング、強化学習に関する一連の実験を通じて包括的な分析を提供します。

English

Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.

コード基盤モデルからエージェントとアプリケーションへ：コード知能の実践ガイド

From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

要旨

Support