OmniGIRL: GitHub 이슈 해결을 위한 다국어 및 다중모달 벤치마크

초록

GitHub 이슈 해결 작업은 저장소에 보고된 이슈를 자동으로 해결하는 것을 목표로 합니다. 대규모 언어 모델(LLM)의 발전과 함께 이 작업은 점점 더 많은 관심을 받고 있으며, LLM의 이슈 해결 능력을 평가하기 위한 여러 벤치마크가 제안되었습니다. 그러나 기존 벤치마크에는 세 가지 주요 한계가 있습니다. 첫째, 현재 벤치마크는 단일 프로그래밍 언어에 초점을 맞추고 있어 다양한 언어의 저장소에서 발생하는 이슈를 평가하는 데 제한이 있습니다. 둘째, 일반적으로 좁은 범위의 도메인을 다루기 때문에 실제 세계의 다양한 이슈를 충분히 반영하지 못할 수 있습니다. 셋째, 기존 벤치마크는 이슈 설명의 텍스트 정보에만 의존하여 이미지와 같은 멀티모달 정보를 간과하고 있습니다. 본 논문에서는 다국어, 멀티모달, 다중 도메인을 지원하는 GitHub 이슈 해결 벤치마크인 OmniGIRL을 제안합니다. OmniGIRL은 네 가지 프로그래밍 언어(즉, Python, JavaScript, TypeScript, Java)와 여덟 가지 다른 도메인의 저장소에서 수집된 959개의 작업 인스턴스를 포함합니다. 우리의 평가 결과, 현재의 LLM은 OmniGIRL에서 제한된 성능을 보였습니다. 특히, 가장 성능이 뛰어난 모델인 GPT-4o는 단 8.6%의 이슈만 해결했습니다. 또한, 현재의 LLM은 이미지를 이해해야 하는 이슈를 해결하는 데 어려움을 겪는 것으로 나타났습니다. 이미지 정보가 포함된 이슈에서 가장 좋은 성능을 보인 Claude-3.5-Sonnet도 단 10.5%의 이슈만 해결했습니다. 마지막으로, 우리는 현재 LLM이 OmniGIRL에서 실패하는 이유를 분석하여 향후 개선을 위한 통찰을 제공합니다.

English

The GitHub issue resolution task aims to resolve issues reported in repositories automatically. With advances in large language models (LLMs), this task has gained increasing attention, and several benchmarks are proposed to evaluate the issue resolution ability of LLMs. However, existing benchmarks have three main limitations. First, current benchmarks focus on a single programming language, limiting the evaluation of issues from repositories across different languages. Second, they usually cover a narrow range of domains, which may fail to represent the diversity of real-world issues. Third, existing benchmarks rely solely on textual information in issue descriptions, overlooking multimodal information such as images in issues. In this paper, we propose OmniGIRL, a GitHub Issue ResoLution benchmark that is multilingual, multimodal, and multi-domain. OmniGIRL includes 959 task instances, which are collected from repositories across four programming languages (i.e., Python, JavaScript, TypeScript, and Java) and eight different domains. Our evaluation shows that current LLMs show limited performances on OmniGIRL. Notably, the best-performing model, GPT-4o, resolves only 8.6% of the issues. Besides, we find that current LLMs struggle to resolve issues requiring understanding images. The best performance is achieved by Claude-3.5-Sonnet, which resolves only 10.5% of the issues with image information. Finally, we analyze the reasons behind current LLMs' failure on OmniGIRL, providing insights for future improvements.