CrossCodeEval：一个用于跨文件代码补全的多样化和多语言基准测试。

摘要

代码补全模型在近年取得了显著进展，然而当前流行的评估数据集，如HumanEval和MBPP，主要集中在单个文件内的代码补全任务上。这种过于简化的设置无法代表现实世界的软件开发场景，其中存储库跨越多个文件，具有许多文件间的依赖关系，通常需要访问和理解跨文件上下文才能正确完成代码。为了填补这一空白，我们提出了CrossCodeEval，这是一个多样化且多语言的代码补全基准，需要深入的跨文件上下文理解才能准确完成代码。CrossCodeEval建立在一组多样化的真实世界、开源、许可证宽松的存储库上，涵盖四种流行的编程语言：Python、Java、TypeScript和C#。为了创建严格需要跨文件上下文才能准确完成的示例，我们提出了一种简单而高效的基于静态分析的方法，以确定当前文件中跨文件上下文的使用。对于像CodeGen和StarCoder这样的最先进代码语言模型的广泛实验表明，当相关的跨文件上下文缺失时，CrossCodeEval极具挑战性，而在提示中添加这些上下文时我们看到明显的改进。然而，尽管有这些改进，即使使用最高性能的模型，性能的巅峰仍然尚未达到，这表明CrossCodeEval也能够评估模型利用广泛上下文以实现更好代码补全的能力。最后，我们对检索跨文件上下文的各种方法进行了基准测试，并展示CrossCodeEval也可用于衡量代码检索器的能力。

English

Code completion models have made significant progress in recent years, yet current popular evaluation datasets, such as HumanEval and MBPP, predominantly focus on code completion tasks within a single file. This over-simplified setting falls short of representing the real-world software development scenario where repositories span multiple files with numerous cross-file dependencies, and accessing and understanding cross-file context is often required to complete the code correctly. To fill in this gap, we propose CrossCodeEval, a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#. To create examples that strictly require cross-file context for accurate completion, we propose a straightforward yet efficient static-analysis-based approach to pinpoint the use of cross-file context within the current file. Extensive experiments on state-of-the-art code language models like CodeGen and StarCoder demonstrate that CrossCodeEval is extremely challenging when the relevant cross-file context is absent, and we see clear improvements when adding these context into the prompt. However, despite such improvements, the pinnacle of performance remains notably unattained even with the highest-performing model, indicating that CrossCodeEval is also capable of assessing model's capability in leveraging extensive context to make better code completion. Finally, we benchmarked various methods in retrieving cross-file context, and show that CrossCodeEval can also be used to measure the capability of code retrievers.

CrossCodeEval：一个用于跨文件代码补全的多样化和多语言基准测试。

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

摘要

Support