MIGRATION-BENCH:Java 8代码库级迁移基准测试
MIGRATION-BENCH: Repository-Level Code Migration Benchmark from Java 8
May 14, 2025
作者: Linbo Liu, Xinle Liu, Qiang Zhou, Lin Chen, Yihan Liu, Hoan Nguyen, Behrooz Omidvar-Tehrani, Xi Shen, Jun Huan, Omer Tripp, Anoop Deoras
cs.AI
摘要
随着近年来强大大型语言模型(LLMs)的快速发展,众多软件工程任务现可通过LLMs得到解决,极大地提升了生产力和可扩展性。为评估这些模型的编码能力,已开发出大量基准数据集,但这些数据集主要聚焦于问题解决和故障排除任务。相比之下,我们引入了一个新的编码基准MIGRATION-BENCH,其独特关注点在于代码迁移。MIGRATION-BENCH旨在作为从Java 8迁移至最新长期支持(LTS)版本(Java 17、21)的全面基准,包含完整数据集及其精选子集,分别涵盖5,102和300个代码库。精选子集基于复杂性和难度精心挑选,为代码迁移领域的研究提供了多样化的资源支持。此外,我们提供了一套全面的评估框架,以促进对这一挑战性任务进行严格且标准化的LLMs评估。我们进一步提出了SD-Feedback,并证明LLMs能有效应对仓库级别的代码迁移至Java 17。对于使用Claude-3.5-Sonnet-v2的精选子集,SD-Feedback在最小和最大迁移上的成功率(pass@1)分别达到62.33%和27.00%。基准数据集及源代码可分别访问:
https://huggingface.co/collections/AmazonScience 和
https://github.com/amazon-science/self_debug。
English
With the rapid advancement of powerful large language models (LLMs) in recent
years, a wide range of software engineering tasks can now be addressed using
LLMs, significantly enhancing productivity and scalability. Numerous benchmark
datasets have been developed to evaluate the coding capabilities of these
models, while they primarily focus on problem-solving and issue-resolution
tasks. In contrast, we introduce a new coding benchmark MIGRATION-BENCH with a
distinct focus: code migration. MIGRATION-BENCH aims to serve as a
comprehensive benchmark for migration from Java 8 to the latest long-term
support (LTS) versions (Java 17, 21), MIGRATION-BENCH includes a full dataset
and its subset selected with 5,102 and 300 repositories respectively.
Selected is a representative subset curated for complexity and difficulty,
offering a versatile resource to support research in the field of code
migration. Additionally, we provide a comprehensive evaluation framework to
facilitate rigorous and standardized assessment of LLMs on this challenging
task. We further propose SD-Feedback and demonstrate that LLMs can effectively
tackle repository-level code migration to Java 17. For the selected subset with
Claude-3.5-Sonnet-v2, SD-Feedback achieves 62.33% and 27.00% success rate
(pass@1) for minimal and maximal migration respectively. The benchmark dataset
and source code are available at:
https://huggingface.co/collections/AmazonScience and
https://github.com/amazon-science/self_debug respectively.Summary
AI-Generated Summary