CODESYNC: Het synchroniseren van grote taalmodellen met dynamische code-evolutie op schaal

Samenvatting

Grote Taalmodellen (LLMs) hebben uitzonderlijke prestaties vertoond in software engineering, maar staan voor uitdagingen bij het aanpassen aan voortdurend evoluerende codekennis, met name met betrekking tot de frequente updates van API's van externe bibliotheken. Deze beperking, voortkomend uit statische voorafgaande trainingsdatasets, resulteert vaak in niet-uitvoerbare code of implementaties met suboptimale veiligheid en efficiëntie. Daarom introduceert dit artikel CODESYNC, een gegevensengine voor het identificeren van verouderde codepatronen en het verzamelen van real-time codekennisupdates van Python-bibliotheken van derden. Voortbouwend op CODESYNC ontwikkelen we CODESYNCBENCH, een uitgebreide benchmark om de mogelijkheid van LLMs om gesynchroniseerd te blijven met code-evolutie te beoordelen, die real-world updates omvat voor 220 API's van zes Python-bibliotheken. Onze benchmark biedt 3.300 testcases over drie evaluatietaken en een dataset voor het afstemmen van instructies met bewustzijn van updates, bestaande uit 2.200 trainingsvoorbeelden. Uitgebreide experimenten met 14 toonaangevende LLMs tonen aan dat ze moeite hebben met dynamische code-evolutie, zelfs met de ondersteuning van geavanceerde kennisupdate-methoden (bijv. DPO, ORPO en SimPO). Wij geloven dat onze benchmark een sterke basis kan bieden voor de ontwikkeling van effectievere methoden voor real-time codekennisupdates in de toekomst. De experimentele code en dataset zijn openbaar beschikbaar op: https://github.com/Lucky-voyage/Code-Sync.

English

Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of third-party library APIs. This limitation, stemming from static pre-training datasets, often results in non-executable code or implementations with suboptimal safety and efficiency. To this end, this paper introduces CODESYNC, a data engine for identifying outdated code patterns and collecting real-time code knowledge updates from Python third-party libraries. Building upon CODESYNC, we develop CODESYNCBENCH, a comprehensive benchmark for assessing LLMs' ability to stay synchronized with code evolution, which covers real-world updates for 220 APIs from six Python libraries. Our benchmark offers 3,300 test cases across three evaluation tasks and an update-aware instruction tuning dataset consisting of 2,200 training samples. Extensive experiments on 14 state-of-the-art LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and SimPO). We believe that our benchmark can offer a strong foundation for the development of more effective methods for real-time code knowledge updating in the future. The experimental code and dataset are publicly available at: https://github.com/Lucky-voyage/Code-Sync.

CODESYNC: Het synchroniseren van grote taalmodellen met dynamische code-evolutie op schaal

CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

Samenvatting

Support