GitChameleon:評估AI代碼生成對抗Python庫版本不相容性
GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities
July 16, 2025
作者: Diganta Misra, Nizar Islah, Victor May, Brice Rauby, Zihan Wang, Justine Gehring, Antonio Orvieto, Muawiz Chaudhary, Eilif B. Muller, Irina Rish, Samira Ebrahimi Kahou, Massimo Caccia
cs.AI
摘要
軟體函式庫的快速演進為程式碼生成帶來了重大挑戰,這需要持續適應頻繁的版本更新,同時保持向後兼容性。雖然現有的程式碼演化基準提供了有價值的見解,但它們通常缺乏基於執行的評估,無法生成符合特定函式庫版本的程式碼。為了解決這個問題,我們引入了GitChameleon,這是一個新穎且精心策劃的資料集,包含328個Python程式碼補全問題,每個問題都基於特定的函式庫版本,並附帶可執行的單元測試。GitChameleon嚴格評估了當代大型語言模型(LLMs)、LLM驅動的代理、程式碼助手和RAG系統在執行中展示功能準確性的版本條件程式碼生成能力。我們廣泛的評估表明,最先進的系統在這一任務上面臨著重大挑戰;企業模型在48-51%的範圍內達到了基線成功率,凸顯了這一問題的複雜性。通過提供一個強調程式碼函式庫動態特性的基於執行的基準,GitChameleon使人們能夠更清晰地理解這一挑戰,並有助於指導開發更適應性和可靠的AI程式碼生成方法。我們將資料集和評估程式碼公開於https://github.com/mrcabbage972/GitChameleonBenchmark。
English
The rapid evolution of software libraries poses a considerable hurdle for
code generation, necessitating continuous adaptation to frequent version
updates while preserving backward compatibility. While existing code evolution
benchmarks provide valuable insights, they typically lack execution-based
evaluation for generating code compliant with specific library versions. To
address this, we introduce GitChameleon, a novel, meticulously curated dataset
comprising 328 Python code completion problems, each conditioned on specific
library versions and accompanied by executable unit tests. GitChameleon
rigorously evaluates the capacity of contemporary large language models (LLMs),
LLM-powered agents, code assistants, and RAG systems to perform
version-conditioned code generation that demonstrates functional accuracy
through execution. Our extensive evaluations indicate that state-of-the-art
systems encounter significant challenges with this task; enterprise models
achieving baseline success rates in the 48-51\% range, underscoring the
intricacy of the problem. By offering an execution-based benchmark emphasizing
the dynamic nature of code libraries, GitChameleon enables a clearer
understanding of this challenge and helps guide the development of more
adaptable and dependable AI code generation methods. We make the dataset and
evaluation code publicly available at
https://github.com/mrcabbage972/GitChameleonBenchmark.