SWE-bench-java: 자바를 위한 GitHub 이슈 해결 벤치마크

초록

GitHub 이슈 해결은 소프트웨어 엔지니어링에서 중요한 작업으로, 최근에는 산업 및 학계에서 큰 관심을 받고 있습니다. 이 작업 내에서 SWE-bench가 출시되어 대규모 언어 모델(LLMs)의 이슈 해결 능력을 평가하였으나, 현재는 파이썬 버전에만 초점을 맞추고 있습니다. 그러나 더 많은 프로그래밍 언어를 지원하는 것도 중요한데, 산업에서 강한 요구가 있습니다. 다국어 지원을 위한 첫 번째 단계로, 저희는 SWE-bench의 Java 버전인 SWE-bench-java를 개발하였습니다. 해당 데이터셋은 공개되었으며, 해당 Docker 기반의 평가 환경과 리더보드도 함께 제공되었으며, 이는 앞으로 몇 달 동안 지속적으로 유지 및 업데이트될 예정입니다. SWE-bench-java의 신뢰성을 확인하기 위해, 우리는 고전적인 방법인 SWE-agent를 구현하고 여러 강력한 LLMs를 테스트하였습니다. 고품질의 다국어 벤치마크를 개발하는 것이 시간이 많이 소요되고 노동 집약적이라는 것은 잘 알려져 있기에, 우리는 이를 가속화하고 정제하기 위해 풀 리퀘스트나 협업을 통한 기여를 환영하며, 완전히 자동화된 프로그래밍을 위한 길을 열어갈 것입니다.

English

GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in industry. As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. We have publicly released the dataset, along with the corresponding Docker-based evaluation environment and leaderboard, which will be continuously maintained and updated in the coming months. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it. As is well known, developing a high-quality multi-lingual benchmark is time-consuming and labor-intensive, so we welcome contributions through pull requests or collaboration to accelerate its iteration and refinement, paving the way for fully automated programming.

SWE-bench-java: 자바를 위한 GitHub 이슈 해결 벤치마크

SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

초록

Support