SWE-bench-java：用于Java的GitHub问题解决基准测试

摘要

在软件工程中，解决GitHub问题是一项关键任务，最近在工业界和学术界都受到了重视。在这一任务中，SWE-bench已被发布用于评估大型语言模型（LLMs）的问题解决能力，但目前仅专注于Python版本。然而，支持更多编程语言同样重要，因为工业界对此有很强的需求。作为迈向多语言支持的第一步，我们开发了SWE-bench的Java版本，称为SWE-bench-java。我们已经公开发布了数据集，以及相应的基于Docker的评估环境和排行榜，这些将在接下来的几个月内持续维护和更新。为了验证SWE-bench-java的可靠性，我们实现了一个经典方法SWE-agent，并在其上测试了几个强大的LLMs。众所周知，开发高质量的多语言基准测试是耗时且劳动密集的，因此我们欢迎通过拉取请求或合作来加速其迭代和完善，为完全自动化编程铺平道路。

English

GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in industry. As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. We have publicly released the dataset, along with the corresponding Docker-based evaluation environment and leaderboard, which will be continuously maintained and updated in the coming months. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it. As is well known, developing a high-quality multi-lingual benchmark is time-consuming and labor-intensive, so we welcome contributions through pull requests or collaboration to accelerate its iteration and refinement, paving the way for fully automated programming.

SWE-bench-java：用于Java的GitHub问题解决基准测试

SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

摘要

Support