长代码竞技场：一组用于长上下文代码模型的基准测试

摘要

如今，编码和自然语言处理领域正在迅速发展。特别是，模型在处理长上下文窗口方面变得更加优秀 - 支持的上下文大小在过去几年内增加了数量级。然而，目前缺乏超越单个文件上下文的编码处理基准，而最流行的基准仅限于单个方法。通过这项工作，我们旨在填补这一空白，推出了长代码竞技场(Long Code Arena)，这是一个包含六项基准的编码处理任务套件，需要整个项目范围的上下文。这些任务涵盖了编码处理的不同方面：基于库的代码生成、CI构建修复、项目级代码完成、提交消息生成、错误定位和模块摘要。对于每个任务，我们提供了经过手工验证的测试数据集、评估套件，并基于流行的LLM提供开源基线解决方案，以展示数据集的使用方式，并简化其他研究人员的采用。我们在HuggingFace Spaces上发布了基准页面，包括排行榜、所有数据集的HuggingFace Hub链接，以及带有基线的GitHub存储库链接：https://huggingface.co/spaces/JetBrains-Research/long-code-arena。

English

Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.

长代码竞技场：一组用于长上下文代码模型的基准测试

Long Code Arena: a Set of Benchmarks for Long-Context Code Models

摘要

Summary

Support

Support