ロングコードアリーナ：長文脈コードモデルのためのベンチマークセット

要旨

近年、コード処理と自然言語処理の分野は急速に進化しています。特に、モデルの長いコンテキストウィンドウを処理する能力が向上しており、ここ数年でサポートされるコンテキストサイズは桁違いに拡大しています。しかし、単一ファイルを超えるコンテキストを必要とするコード処理のベンチマークは不足しており、最も一般的なものは単一メソッドに限定されています。本研究では、このギャップを埋めるため、プロジェクト全体のコンテキストを必要とするコード処理タスクのための6つのベンチマークスイート「Long Code Arena」を導入します。これらのタスクは、ライブラリベースのコード生成、CIビルドの修復、プロジェクトレベルのコード補完、コミットメッセージ生成、バグの局所化、モジュール要約など、コード処理のさまざまな側面をカバーしています。各タスクに対して、手動で検証されたテスト用データセット、評価スイート、および人気の大規模言語モデル（LLM）に基づいたオープンソースのベースラインソリューションを提供し、データセットの使用例を示すとともに、他の研究者による採用を容易にします。ベンチマークページはHuggingFace Spacesに公開されており、リーダーボード、すべてのデータセットへのHuggingFace Hubのリンク、およびベースラインを含むGitHubリポジトリへのリンクが掲載されています：https://huggingface.co/spaces/JetBrains-Research/long-code-arena。

English

Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.

ロングコードアリーナ：長文脈コードモデルのためのベンチマークセット

Long Code Arena: a Set of Benchmarks for Long-Context Code Models

要旨

Support