M2rc-Eval: 大規模多言語リポジトリレベルのコード補完評価

要旨

ソフトウェアエンジニアリングにおいて、リポジトリレベルのコード補完は大きな注目を集めており、いくつかのベンチマークデータセットが導入されています。しかし、既存のリポジトリレベルのコード補完ベンチマークは通常、限られた数の言語（5未満）に焦点を当てており、既存の大規模言語モデル（LLMs）における異なる言語間の一般的なコード知能能力を評価することができません。さらに、既存のベンチマークは通常、異なる言語の全体的な平均スコアを報告しており、異なる補完シナリオにおける細かい能力が無視されています。そのため、多言語シナリオにおけるコードLLMsの研究を促進するために、18のプログラミング言語をカバーする大規模多言語リポジトリレベルのコード補完ベンチマーク（M2RC-EVALと呼ばれる）を提案し、異なる補完シナリオにおけるバケットレベルと意味レベルの2種類の細かい注釈を提供します。これらの注釈は、解析された抽象構文木に基づいて取得しています。さらに、既存のコードLLMsのリポジトリレベルのコード補完能力を向上させるために、大規模多言語の命令コーパスであるM2RC-INSTRUCTデータセットを収集しています。包括的な実験結果は、当社のM2RC-EVALおよびM2RC-INSTRUCTの効果を実証しています。

English

Repository-level code completion has drawn great attention in software engineering, and several benchmark datasets have been introduced. However, existing repository-level code completion benchmarks usually focus on a limited number of languages (<5), which cannot evaluate the general code intelligence abilities across different languages for existing code Large Language Models (LLMs). Besides, the existing benchmarks usually report overall average scores of different languages, where the fine-grained abilities in different completion scenarios are ignored. Therefore, to facilitate the research of code LLMs in multilingual scenarios, we propose a massively multilingual repository-level code completion benchmark covering 18 programming languages (called M2RC-EVAL), and two types of fine-grained annotations (i.e., bucket-level and semantic-level) on different completion scenarios are provided, where we obtain these annotations based on the parsed abstract syntax tree. Moreover, we also curate a massively multilingual instruction corpora M2RC- INSTRUCT dataset to improve the repository-level code completion abilities of existing code LLMs. Comprehensive experimental results demonstrate the effectiveness of our M2RC-EVAL and M2RC-INSTRUCT.

M2rc-Eval: 大規模多言語リポジトリレベルのコード補完評価

M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

要旨

Support