롱 코드 아레나: 장문 컨텍스트 코드 모델을 위한 벤치마크 세트

초록

오늘날, 코드 및 자연어 처리 분야는 빠르게 진화하고 있습니다. 특히, 모델들은 긴 컨텍스트 윈도우를 처리하는 능력이 향상되었으며, 지난 몇 년 동안 지원되는 컨텍스트 크기는 수십 배 증가했습니다. 그러나 단일 파일 이상의 컨텍스트를 다루는 코드 처리 벤치마크는 부족한 상황이며, 가장 널리 사용되는 벤치마크들은 단일 메소드에 국한되어 있습니다. 본 연구에서는 이러한 격차를 해소하기 위해 프로젝트 전반의 컨텍스트가 필요한 코드 처리 작업을 위한 6가지 벤치마크로 구성된 Long Code Arena를 소개합니다. 이 작업들은 코드 처리의 다양한 측면을 다루고 있습니다: 라이브러리 기반 코드 생성, CI 빌드 수정, 프로젝트 수준 코드 완성, 커밋 메시지 생성, 버그 위치 파악, 모듈 요약 등이 포함됩니다. 각 작업에 대해, 테스트를 위해 수동으로 검증된 데이터셋, 평가 스위트, 그리고 데이터셋 사용을 보여주고 다른 연구자들의 채용을 용이하게 하기 위한 인기 있는 LLM 기반의 오픈소스 베이스라인 솔루션을 제공합니다. 우리는 HuggingFace Spaces에 리더보드, 모든 데이터셋에 대한 HuggingFace Hub 링크, 그리고 베이스라인이 포함된 GitHub 저장소 링크와 함께 벤치마크 페이지를 공개합니다: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.

English

Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.

롱 코드 아레나: 장문 컨텍스트 코드 모델을 위한 벤치마크 세트

Long Code Arena: a Set of Benchmarks for Long-Context Code Models

초록

Support