コード向け言語モデルのサーベイ

要旨

本研究では、コード処理における言語モデルの最近の進展を体系的にレビューし、50以上のモデル、30以上の評価タスク、および500以上の関連研究を網羅しています。コード処理モデルを、GPTファミリーに代表される汎用言語モデルと、コードに特化して事前学習された専門モデル（多くの場合、特定の目的に合わせて設計されたもの）に分類しました。これらのモデル間の関係と相違点について議論し、コードモデリングが統計モデルやRNNから事前学習済みTransformerやLLMへと移行した歴史的変遷を強調しました。これはNLPが辿った道程とまさに同じです。また、AST（抽象構文木）、CFG（制御フローグラフ）、ユニットテストといったコード固有の特徴と、それらがコード言語モデルの学習にどのように応用されているかについても議論し、この分野における主要な課題と将来の可能性のある方向性を明らかにしました。本調査はGitHubリポジトリ（https://github.com/codefuse-ai/Awesome-Code-LLM）で公開・更新を継続しています。

English

In this work we systematically review the recent advancements in code processing with language models, covering 50+ models, 30+ evaluation tasks, and 500 related works. We break down code processing models into general language models represented by the GPT family and specialized models that are specifically pretrained on code, often with tailored objectives. We discuss the relations and differences between these models, and highlight the historical transition of code modeling from statistical models and RNNs to pretrained Transformers and LLMs, which is exactly the same course that had been taken by NLP. We also discuss code-specific features such as AST, CFG, and unit tests, along with their application in training code language models, and identify key challenges and potential future directions in this domain. We keep the survey open and updated on github repository at https://github.com/codefuse-ai/Awesome-Code-LLM.

コード向け言語モデルのサーベイ

A Survey on Language Models for Code

要旨

Support