MIGRATION-BENCH:Java 8 至倉庫層級程式碼遷移基準測試
MIGRATION-BENCH: Repository-Level Code Migration Benchmark from Java 8
May 14, 2025
作者: Linbo Liu, Xinle Liu, Qiang Zhou, Lin Chen, Yihan Liu, Hoan Nguyen, Behrooz Omidvar-Tehrani, Xi Shen, Jun Huan, Omer Tripp, Anoop Deoras
cs.AI
摘要
隨著近年來強大大型語言模型(LLMs)的快速發展,現今已能利用LLMs處理多種軟體工程任務,顯著提升了生產力與可擴展性。為評估這些模型的編碼能力,已開發出眾多基準數據集,然而這些數據集主要聚焦於問題解決與故障排除任務。與此相對,我們引入了一個新的編碼基準MIGRATION-BENCH,其獨特之處在於專注於程式碼遷移。MIGRATION-BENCH旨在作為從Java 8遷移至最新長期支援(LTS)版本(Java 17、21)的全面基準,包含完整數據集及其精選子集,分別涵蓋5,102和300個倉庫。精選子集基於複雜度與難度精心挑選,為程式碼遷移領域的研究提供了多功能的資源。此外,我們提供了一套全面的評估框架,以促進對LLMs在這一挑戰性任務上的嚴格與標準化評估。我們進一步提出了SD-Feedback,並展示了LLMs能有效應對倉庫層級的Java 17程式碼遷移。對於使用Claude-3.5-Sonnet-v2的精選子集,SD-Feedback在最小與最大遷移上分別達到了62.33%和27.00%的成功率(pass@1)。基準數據集與源代碼可分別於以下網址獲取:https://huggingface.co/collections/AmazonScience 和 https://github.com/amazon-science/self_debug。
English
With the rapid advancement of powerful large language models (LLMs) in recent
years, a wide range of software engineering tasks can now be addressed using
LLMs, significantly enhancing productivity and scalability. Numerous benchmark
datasets have been developed to evaluate the coding capabilities of these
models, while they primarily focus on problem-solving and issue-resolution
tasks. In contrast, we introduce a new coding benchmark MIGRATION-BENCH with a
distinct focus: code migration. MIGRATION-BENCH aims to serve as a
comprehensive benchmark for migration from Java 8 to the latest long-term
support (LTS) versions (Java 17, 21), MIGRATION-BENCH includes a full dataset
and its subset selected with 5,102 and 300 repositories respectively.
Selected is a representative subset curated for complexity and difficulty,
offering a versatile resource to support research in the field of code
migration. Additionally, we provide a comprehensive evaluation framework to
facilitate rigorous and standardized assessment of LLMs on this challenging
task. We further propose SD-Feedback and demonstrate that LLMs can effectively
tackle repository-level code migration to Java 17. For the selected subset with
Claude-3.5-Sonnet-v2, SD-Feedback achieves 62.33% and 27.00% success rate
(pass@1) for minimal and maximal migration respectively. The benchmark dataset
and source code are available at:
https://huggingface.co/collections/AmazonScience and
https://github.com/amazon-science/self_debug respectively.Summary
AI-Generated Summary