CODESYNC：大規模語言模型與動態程式碼演化的同步化

摘要

大型語言模型（LLMs）在軟體工程領域展現了卓越的性能，但在適應不斷演進的程式碼知識方面仍面臨挑戰，尤其是針對第三方函式庫API的頻繁更新。這一限制源於靜態的預訓練資料集，常導致生成不可執行的程式碼或安全性和效率欠佳的實現。為此，本文提出了CODESYNC，這是一個用於識別過時代碼模式並從Python第三方函式庫收集即時代碼知識更新的資料引擎。基於CODESYNC，我們開發了CODESYNCBENCH，這是一個全面的基準測試，用於評估LLMs在保持與程式碼演進同步方面的能力，涵蓋了六個Python函式庫中220個API的實際更新情況。我們的基準測試提供了三項評估任務中的3,300個測試案例，以及一個包含2,200個訓練樣本的更新感知指令微調資料集。對14個最先進的LLMs進行的廣泛實驗表明，即使在高級知識更新方法（如DPO、ORPO和SimPO）的支持下，它們仍難以應對動態的程式碼演進。我們相信，我們的基準測試能為未來開發更有效的即時代碼知識更新方法奠定堅實基礎。實驗程式碼和資料集已公開於：https://github.com/Lucky-voyage/Code-Sync。

English

Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of third-party library APIs. This limitation, stemming from static pre-training datasets, often results in non-executable code or implementations with suboptimal safety and efficiency. To this end, this paper introduces CODESYNC, a data engine for identifying outdated code patterns and collecting real-time code knowledge updates from Python third-party libraries. Building upon CODESYNC, we develop CODESYNCBENCH, a comprehensive benchmark for assessing LLMs' ability to stay synchronized with code evolution, which covers real-world updates for 220 APIs from six Python libraries. Our benchmark offers 3,300 test cases across three evaluation tasks and an update-aware instruction tuning dataset consisting of 2,200 training samples. Extensive experiments on 14 state-of-the-art LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and SimPO). We believe that our benchmark can offer a strong foundation for the development of more effective methods for real-time code knowledge updating in the future. The experimental code and dataset are publicly available at: https://github.com/Lucky-voyage/Code-Sync.

CODESYNC：大規模語言模型與動態程式碼演化的同步化

CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

摘要

Support