CrossCodeEval：一個多元且多語言的跨檔案程式碼完成基準測試

摘要

程式碼完成模型近年來取得了顯著進展，然而當前流行的評估資料集，如HumanEval和MBPP，主要集中在單一檔案內的程式碼完成任務上。這種過於簡化的設定無法充分代表現實世界的軟體開發情境，其中存儲庫涵蓋多個檔案，具有眾多跨檔案依賴，並且通常需要存取和理解跨檔案上下文才能正確完成程式碼。為填補這一空白，我們提出CrossCodeEval，這是一個多樣且多語言的程式碼完成基準測試，需要深入了解跨檔案上下文才能準確完成程式碼。CrossCodeEval建立在一組多樣的真實世界、開源、權限開放的存儲庫上，涵蓋四種流行的程式設計語言：Python、Java、TypeScript和C#。為了創建嚴格需要跨檔案上下文才能準確完成的範例，我們提出了一種簡單而高效的基於靜態分析的方法，來準確指出當前檔案中使用跨檔案上下文的地方。對於像CodeGen和StarCoder等最先進的程式碼語言模型的廣泛實驗表明，當相關的跨檔案上下文缺失時，CrossCodeEval是極具挑戰性的，並且在將這些上下文添加到提示中時，我們看到明顯的改善。然而，儘管有這些改進，即使使用最佳表現的模型，性能的巔峰仍然尚未達到，這表明CrossCodeEval也能夠評估模型在利用廣泛上下文以實現更好的程式碼完成方面的能力。最後，我們對檢索跨檔案上下文的各種方法進行了基準測試，並展示CrossCodeEval也可用於評估程式碼檢索器的能力。

English

Code completion models have made significant progress in recent years, yet current popular evaluation datasets, such as HumanEval and MBPP, predominantly focus on code completion tasks within a single file. This over-simplified setting falls short of representing the real-world software development scenario where repositories span multiple files with numerous cross-file dependencies, and accessing and understanding cross-file context is often required to complete the code correctly. To fill in this gap, we propose CrossCodeEval, a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#. To create examples that strictly require cross-file context for accurate completion, we propose a straightforward yet efficient static-analysis-based approach to pinpoint the use of cross-file context within the current file. Extensive experiments on state-of-the-art code language models like CodeGen and StarCoder demonstrate that CrossCodeEval is extremely challenging when the relevant cross-file context is absent, and we see clear improvements when adding these context into the prompt. However, despite such improvements, the pinnacle of performance remains notably unattained even with the highest-performing model, indicating that CrossCodeEval is also capable of assessing model's capability in leveraging extensive context to make better code completion. Finally, we benchmarked various methods in retrieving cross-file context, and show that CrossCodeEval can also be used to measure the capability of code retrievers.

CrossCodeEval：一個多元且多語言的跨檔案程式碼完成基準測試

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

摘要

Support