SWE-bench-java:用于Java的GitHub问题解决基准测试
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
August 26, 2024
作者: Daoguang Zan, Zhirong Huang, Ailun Yu, Shaoxin Lin, Yifan Shi, Wei Liu, Dong Chen, Zongshuai Qi, Hao Yu, Lei Yu, Dezhi Ran, Muhan Zeng, Bo Shen, Pan Bian, Guangtai Liang, Bei Guan, Pengjie Huang, Tao Xie, Yongji Wang, Qianxiang Wang
cs.AI
摘要
在软件工程中,解决GitHub问题是一项关键任务,最近在工业界和学术界都受到了重视。在这一任务中,SWE-bench已被发布用于评估大型语言模型(LLMs)的问题解决能力,但目前仅专注于Python版本。然而,支持更多编程语言同样重要,因为工业界对此有很强的需求。作为迈向多语言支持的第一步,我们开发了SWE-bench的Java版本,称为SWE-bench-java。我们已经公开发布了数据集,以及相应的基于Docker的评估环境和排行榜,这些将在接下来的几个月内持续维护和更新。为了验证SWE-bench-java的可靠性,我们实现了一个经典方法SWE-agent,并在其上测试了几个强大的LLMs。众所周知,开发高质量的多语言基准测试是耗时且劳动密集的,因此我们欢迎通过拉取请求或合作来加速其迭代和完善,为完全自动化编程铺平道路。
English
GitHub issue resolving is a critical task in software engineering, recently
gaining significant attention in both industry and academia. Within this task,
SWE-bench has been released to evaluate issue resolving capabilities of large
language models (LLMs), but has so far only focused on Python version. However,
supporting more programming languages is also important, as there is a strong
demand in industry. As a first step toward multilingual support, we have
developed a Java version of SWE-bench, called SWE-bench-java. We have publicly
released the dataset, along with the corresponding Docker-based evaluation
environment and leaderboard, which will be continuously maintained and updated
in the coming months. To verify the reliability of SWE-bench-java, we implement
a classic method SWE-agent and test several powerful LLMs on it. As is well
known, developing a high-quality multi-lingual benchmark is time-consuming and
labor-intensive, so we welcome contributions through pull requests or
collaboration to accelerate its iteration and refinement, paving the way for
fully automated programming.Summary
AI-Generated Summary