SWE-bench-java：用於Java的GitHub問題解決基準测试

摘要

在軟體工程中，解決 GitHub 問題是一項關鍵任務，最近在工業界和學術界都受到重視。在這個任務中，SWE-bench 已經釋出，用於評估大型語言模型（LLMs）的問題解決能力，但目前僅專注於 Python 版本。然而，支援更多編程語言也很重要，因為工業界有很強烈的需求。作為支援多語言的第一步，我們開發了 Java 版本的 SWE-bench，名為 SWE-bench-java。我們已經公開發布了數據集，以及相應的基於 Docker 的評估環境和排行榜，這將在接下來的幾個月持續維護和更新。為了驗證 SWE-bench-java 的可靠性，我們實現了一個經典方法 SWE-agent，並在其上測試了幾個強大的 LLMs。眾所周知，開發高質量的多語言基準測試是耗時且勞動密集的，因此我們歡迎通過拉取請求或合作來加速其迭代和完善，為完全自動化編程鋪平道路。

English

GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in industry. As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. We have publicly released the dataset, along with the corresponding Docker-based evaluation environment and leaderboard, which will be continuously maintained and updated in the coming months. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it. As is well known, developing a high-quality multi-lingual benchmark is time-consuming and labor-intensive, so we welcome contributions through pull requests or collaboration to accelerate its iteration and refinement, paving the way for fully automated programming.

SWE-bench-java：用於Java的GitHub問題解決基準测试

SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

摘要

Support