長代碼競技場:用於長文本代碼模型的基準測試集
Long Code Arena: a Set of Benchmarks for Long-Context Code Models
June 17, 2024
作者: Egor Bogomolov, Aleksandra Eliseeva, Timur Galimzyanov, Evgeniy Glukhov, Anton Shapkin, Maria Tigina, Yaroslav Golubev, Alexander Kovrigin, Arie van Deursen, Maliheh Izadi, Timofey Bryksin
cs.AI
摘要
如今,程式碼和自然語言處理領域正在迅速發展。特別是模型在處理長上下文窗口方面表現得越來越好 - 在過去幾年中,支持的上下文大小已經增加了數個數量級。然而,在程式碼處理方面,缺乏超越單個上下文文件的基準,而最受歡迎的基準僅限於單個方法。通過這項工作,我們旨在通過引入Long Code Arena 來彌補這一差距,這是一套包含六個基準的程式碼處理任務,需要整個專案範圍的上下文。這些任務涵蓋了程式碼處理的不同方面:基於庫的程式碼生成、CI 構建修復、專案級程式碼完成、提交消息生成、錯誤定位和模組摘要。對於每個任務,我們提供了經過手動驗證的測試數據集、評估套件,以及基於流行的LLM的開源基準解決方案,以展示數據集的使用方式,並簡化其他研究人員的採用。我們在 HuggingFace Spaces 上發布了基準頁面,其中包含排行榜、所有數據集的 HuggingFace Hub 鏈接,以及包含基準解決方案的 GitHub 存儲庫鏈接:https://huggingface.co/spaces/JetBrains-Research/long-code-arena。
English
Nowadays, the fields of code and natural language processing are evolving
rapidly. In particular, models become better at processing long context windows
- supported context sizes have increased by orders of magnitude over the last
few years. However, there is a shortage of benchmarks for code processing that
go beyond a single file of context, while the most popular ones are limited to
a single method. With this work, we aim to close this gap by introducing Long
Code Arena, a suite of six benchmarks for code processing tasks that require
project-wide context. These tasks cover different aspects of code processing:
library-based code generation, CI builds repair, project-level code completion,
commit message generation, bug localization, and module summarization. For each
task, we provide a manually verified dataset for testing, an evaluation suite,
and open-source baseline solutions based on popular LLMs to showcase the usage
of the dataset and to simplify adoption by other researchers. We publish the
benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace
Hub for all the datasets, and link to the GitHub repository with baselines:
https://huggingface.co/spaces/JetBrains-Research/long-code-arena.Summary
AI-Generated Summary