LONGCODEU：長上下文語言模型在長代碼理解上的基準測試

摘要

當前先進的長上下文語言模型在現實世界的軟體工程應用中展現出巨大潛力。然而，這一關鍵領域的進展仍受到一個根本性限制的阻礙：缺乏針對長程式碼理解的嚴謹評估框架。為彌補這一不足，我們提出了一個長程式碼理解基準LONGCODEU，從四個方面（共8項任務）來評估長上下文語言模型在實際應用中所需的長程式碼理解能力，包括程式碼單元感知、程式碼單元內部理解、程式碼單元間關係理解以及長程式碼文件理解。我們在LONGCODEU上評估了9種流行的長上下文語言模型（即6種通用模型和3種程式碼模型）。實驗結果揭示了當前長上下文語言模型在長程式碼理解能力上的關鍵限制。特別是，當長程式碼長度超過32K時，這些模型的性能急劇下降，遠未達到其宣稱的128K至1M的上下文窗口。在四個方面中，程式碼單元間關係理解對長上下文語言模型最具挑戰性。本研究為優化長上下文語言模型及推動軟體工程領域的進步提供了寶貴的見解。

English

Current advanced long-context language models offer great potential for real-world software engineering applications. However, progress in this critical domain remains hampered by a fundamental limitation: the absence of a rigorous evaluation framework for long code understanding. To gap this obstacle, we propose a long code understanding benchmark LONGCODEU from four aspects (8 tasks) to evaluate LCLMs' long code understanding ability required for practical applications, including code unit perception, intra-code unit understanding, inter-code unit relation understanding, and long code documentation understanding. We evaluate 9 popular LCLMs on LONGCODEU (i.e., 6 general models and 3 code models). Our experimental results reveal key limitations in current LCLMs' capabilities for long code understanding. Particularly, the performance of LCLMs drops dramatically when the long code length is greater than 32K, falling far short of their claimed 128K-1M context windows. In the four aspects, inter-code unit relation understanding is the most challenging for LCLMs. Our study provides valuable insights for optimizing LCLMs and driving advancements in software engineering.

LONGCODEU：長上下文語言模型在長代碼理解上的基準測試

LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding

摘要

Support