OmniGIRL: GitHub Issue解決のための多言語・多モーダルベンチマーク

要旨

GitHubイシュー解決タスクは、リポジトリで報告されたイシューを自動的に解決することを目的としています。大規模言語モデル（LLM）の進展に伴い、このタスクは注目を集めており、LLMのイシュー解決能力を評価するためのいくつかのベンチマークが提案されています。しかし、既存のベンチマークには3つの主な制限があります。まず、現在のベンチマークは単一のプログラミング言語に焦点を当てており、異なる言語のリポジトリからのイシューを評価するには限界があります。次に、通常、狭い範囲のドメインをカバーしており、実世界のイシューの多様性を十分に代表できない可能性があります。第三に、既存のベンチマークはイシュー説明文のテキスト情報のみに依存しており、イシュー内の画像などのマルチモーダル情報を見落としています。本論文では、多言語、マルチモーダル、かつ多ドメインのGitHubイシュー解決ベンチマークであるOmniGIRLを提案します。OmniGIRLは、4つのプログラミング言語（Python、JavaScript、TypeScript、Java）と8つの異なるドメインにわたるリポジトリから収集された959のタスクインスタンスを含んでいます。評価の結果、現在のLLMはOmniGIRLにおいて限定的な性能しか示しませんでした。特に、最高性能のモデルであるGPT-4oでも、イシューの8.6%しか解決できませんでした。さらに、現在のLLMは画像の理解を必要とするイシューの解決に苦戦していることがわかりました。画像情報を含むイシューにおいて、最高性能を達成したClaude-3.5-Sonnetでも、10.5%のイシューしか解決できませんでした。最後に、現在のLLMがOmniGIRLで失敗する理由を分析し、今後の改善のための洞察を提供します。

English

The GitHub issue resolution task aims to resolve issues reported in repositories automatically. With advances in large language models (LLMs), this task has gained increasing attention, and several benchmarks are proposed to evaluate the issue resolution ability of LLMs. However, existing benchmarks have three main limitations. First, current benchmarks focus on a single programming language, limiting the evaluation of issues from repositories across different languages. Second, they usually cover a narrow range of domains, which may fail to represent the diversity of real-world issues. Third, existing benchmarks rely solely on textual information in issue descriptions, overlooking multimodal information such as images in issues. In this paper, we propose OmniGIRL, a GitHub Issue ResoLution benchmark that is multilingual, multimodal, and multi-domain. OmniGIRL includes 959 task instances, which are collected from repositories across four programming languages (i.e., Python, JavaScript, TypeScript, and Java) and eight different domains. Our evaluation shows that current LLMs show limited performances on OmniGIRL. Notably, the best-performing model, GPT-4o, resolves only 8.6% of the issues. Besides, we find that current LLMs struggle to resolve issues requiring understanding images. The best performance is achieved by Claude-3.5-Sonnet, which resolves only 10.5% of the issues with image information. Finally, we analyze the reasons behind current LLMs' failure on OmniGIRL, providing insights for future improvements.

OmniGIRL: GitHub Issue解決のための多言語・多モーダルベンチマーク

OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution

要旨

Support