LONGCODEU: 긴 코드 이해 작업에서의 장문맥 언어 모델 벤치마킹

초록

현재의 고급 장문맥 언어 모델(LCLM)은 실세계 소프트웨어 엔지니어링 응용 분야에서 큰 잠재력을 제공합니다. 그러나 이 중요한 영역에서의 진전은 근본적인 한계로 인해 여전히 방해받고 있습니다: 장문 코드 이해를 위한 엄격한 평가 프레임워크의 부재입니다. 이러한 장애를 해소하기 위해, 우리는 실용적인 응용에 필요한 LCLM의 장문 코드 이해 능력을 평가하기 위해 네 가지 측면(8개 작업)에서 장문 코드 이해 벤치마크 LONGCODEU를 제안합니다. 이는 코드 단위 인식, 코드 단위 내 이해, 코드 단위 간 관계 이해, 그리고 장문 코드 문서 이해를 포함합니다. 우리는 LONGCODEU에서 9개의 인기 있는 LCLM(즉, 6개의 일반 모델과 3개의 코드 모델)을 평가했습니다. 우리의 실험 결과는 현재 LCLM의 장문 코드 이해 능력에서 주요 한계를 드러냅니다. 특히, 장문 코드 길이가 32K를 초과할 때 LCLM의 성능이 급격히 하락하며, 그들이 주장하는 128K-1M 컨텍스트 윈도우에 훨씬 미치지 못합니다. 네 가지 측면 중에서 코드 단위 간 관계 이해가 LCLM에게 가장 도전적인 과제입니다. 우리의 연구는 LCLM 최적화와 소프트웨어 엔지니어링의 발전을 이끌기 위한 귀중한 통찰을 제공합니다.

English

Current advanced long-context language models offer great potential for real-world software engineering applications. However, progress in this critical domain remains hampered by a fundamental limitation: the absence of a rigorous evaluation framework for long code understanding. To gap this obstacle, we propose a long code understanding benchmark LONGCODEU from four aspects (8 tasks) to evaluate LCLMs' long code understanding ability required for practical applications, including code unit perception, intra-code unit understanding, inter-code unit relation understanding, and long code documentation understanding. We evaluate 9 popular LCLMs on LONGCODEU (i.e., 6 general models and 3 code models). Our experimental results reveal key limitations in current LCLMs' capabilities for long code understanding. Particularly, the performance of LCLMs drops dramatically when the long code length is greater than 32K, falling far short of their claimed 128K-1M context windows. In the four aspects, inter-code unit relation understanding is the most challenging for LCLMs. Our study provides valuable insights for optimizing LCLMs and driving advancements in software engineering.

LONGCODEU: 긴 코드 이해 작업에서의 장문맥 언어 모델 벤치마킹

LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding

초록

Support