MIGRATION-BENCH: Java 8からリポジトリレベルでのコード移行ベンチマーク

要旨

近年、強力な大規模言語モデル（LLMs）の急速な進展に伴い、幅広いソフトウェアエンジニアリングタスクがLLMsを用いて解決可能となり、生産性とスケーラビリティが大幅に向上しています。これらのモデルのコーディング能力を評価するために、多くのベンチマークデータセットが開発されていますが、それらは主に問題解決や課題解決タスクに焦点を当てています。これに対して、我々はコード移行に特化した新しいコーディングベンチマーク「MIGRATION-BENCH」を紹介します。MIGRATION-BENCHは、Java 8から最新の長期サポート（LTS）バージョン（Java 17、21）への移行を包括的に評価するためのベンチマークとして設計されており、5,102および300のリポジトリから選ばれた完全なデータセットとそのサブセットを含んでいます。選ばれたサブセットは、複雑さと難易度を考慮して選定された代表的なものであり、コード移行分野の研究を支援する多用途のリソースを提供します。さらに、この挑戦的なタスクにおいてLLMsを厳密かつ標準化された方法で評価するための包括的な評価フレームワークを提供します。我々はさらに「SD-Feedback」を提案し、LLMsがリポジトリレベルのコード移行をJava 17に対して効果的に実行できることを実証します。Claude-3.5-Sonnet-v2を用いた選定サブセットにおいて、SD-Feedbackは最小移行と最大移行でそれぞれ62.33%と27.00%の成功率（pass@1）を達成しました。ベンチマークデータセットとソースコードは、それぞれ以下のURLで公開されています： https://huggingface.co/collections/AmazonScience および https://github.com/amazon-science/self_debug。

English

With the rapid advancement of powerful large language models (LLMs) in recent years, a wide range of software engineering tasks can now be addressed using LLMs, significantly enhancing productivity and scalability. Numerous benchmark datasets have been developed to evaluate the coding capabilities of these models, while they primarily focus on problem-solving and issue-resolution tasks. In contrast, we introduce a new coding benchmark MIGRATION-BENCH with a distinct focus: code migration. MIGRATION-BENCH aims to serve as a comprehensive benchmark for migration from Java 8 to the latest long-term support (LTS) versions (Java 17, 21), MIGRATION-BENCH includes a full dataset and its subset selected with 5,102 and 300 repositories respectively. Selected is a representative subset curated for complexity and difficulty, offering a versatile resource to support research in the field of code migration. Additionally, we provide a comprehensive evaluation framework to facilitate rigorous and standardized assessment of LLMs on this challenging task. We further propose SD-Feedback and demonstrate that LLMs can effectively tackle repository-level code migration to Java 17. For the selected subset with Claude-3.5-Sonnet-v2, SD-Feedback achieves 62.33% and 27.00% success rate (pass@1) for minimal and maximal migration respectively. The benchmark dataset and source code are available at: https://huggingface.co/collections/AmazonScience and https://github.com/amazon-science/self_debug respectively.

MIGRATION-BENCH: Java 8からリポジトリレベルでのコード移行ベンチマーク

MIGRATION-BENCH: Repository-Level Code Migration Benchmark from Java 8

要旨

Support