GitChameleon:评估AI代码生成与Python库版本兼容性问题
GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities
July 16, 2025
作者: Diganta Misra, Nizar Islah, Victor May, Brice Rauby, Zihan Wang, Justine Gehring, Antonio Orvieto, Muawiz Chaudhary, Eilif B. Muller, Irina Rish, Samira Ebrahimi Kahou, Massimo Caccia
cs.AI
摘要
软件库的快速演进为代码生成带来了显著挑战,要求持续适应频繁的版本更新,同时保持向后兼容性。尽管现有的代码演化基准提供了宝贵的洞见,但它们通常缺乏基于执行的评估,以生成符合特定库版本的代码。为此,我们推出了GitChameleon,这是一个新颖且精心策划的数据集,包含328个Python代码补全问题,每个问题都针对特定库版本,并附有可执行的单元测试。GitChameleon严格评估了当代大型语言模型(LLMs)、LLM驱动的代理、代码助手以及RAG系统在执行功能准确的版本条件代码生成方面的能力。我们的广泛评估表明,最先进的系统在此任务上面临重大挑战;企业模型的基线成功率在48-51%之间,凸显了问题的复杂性。通过提供一个强调代码库动态特性的基于执行的基准,GitChameleon使人们能更清晰地理解这一挑战,并有助于指导开发更具适应性和可靠性的AI代码生成方法。我们已将数据集和评估代码公开发布于https://github.com/mrcabbage972/GitChameleonBenchmark。
English
The rapid evolution of software libraries poses a considerable hurdle for
code generation, necessitating continuous adaptation to frequent version
updates while preserving backward compatibility. While existing code evolution
benchmarks provide valuable insights, they typically lack execution-based
evaluation for generating code compliant with specific library versions. To
address this, we introduce GitChameleon, a novel, meticulously curated dataset
comprising 328 Python code completion problems, each conditioned on specific
library versions and accompanied by executable unit tests. GitChameleon
rigorously evaluates the capacity of contemporary large language models (LLMs),
LLM-powered agents, code assistants, and RAG systems to perform
version-conditioned code generation that demonstrates functional accuracy
through execution. Our extensive evaluations indicate that state-of-the-art
systems encounter significant challenges with this task; enterprise models
achieving baseline success rates in the 48-51\% range, underscoring the
intricacy of the problem. By offering an execution-based benchmark emphasizing
the dynamic nature of code libraries, GitChameleon enables a clearer
understanding of this challenge and helps guide the development of more
adaptable and dependable AI code generation methods. We make the dataset and
evaluation code publicly available at
https://github.com/mrcabbage972/GitChameleonBenchmark.