OmniGIRL:一個多語言多模態的GitHub問題解決基準測試平台
OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution
May 7, 2025
作者: Lianghong Guo, Wei Tao, Runhan Jiang, Yanlin Wang, Jiachi Chen, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, Zibin Zheng
cs.AI
摘要
GitHub 問題解決任務旨在自動化解決存儲庫中報告的問題。隨著大型語言模型(LLMs)的進步,該任務日益受到關注,並提出了多個基準來評估LLMs的問題解決能力。然而,現有基準存在三個主要限制。首先,當前基準僅專注於單一編程語言,限制了對跨語言存儲庫問題的評估。其次,它們通常涵蓋的領域範圍狹窄,可能無法代表現實世界問題的多樣性。第三,現有基準僅依賴於問題描述中的文本信息,忽略了問題中的多模態信息,如圖像。本文提出OmniGIRL,一個多語言、多模態、多領域的GitHub問題解決基準。OmniGIRL包含959個任務實例,這些實例來自四種編程語言(即Python、JavaScript、TypeScript和Java)和八個不同領域的存儲庫。我們的評估顯示,當前LLMs在OmniGIRL上的表現有限。值得注意的是,表現最佳的模型GPT-4o僅解決了8.6%的問題。此外,我們發現當前LLMs在需要理解圖像的問題上表現不佳。表現最佳的是Claude-3.5-Sonnet,它僅解決了10.5%包含圖像信息的問題。最後,我們分析了當前LLMs在OmniGIRL上失敗的原因,為未來的改進提供了見解。
English
The GitHub issue resolution task aims to resolve issues reported in
repositories automatically. With advances in large language models (LLMs), this
task has gained increasing attention, and several benchmarks are proposed to
evaluate the issue resolution ability of LLMs. However, existing benchmarks
have three main limitations. First, current benchmarks focus on a single
programming language, limiting the evaluation of issues from repositories
across different languages. Second, they usually cover a narrow range of
domains, which may fail to represent the diversity of real-world issues. Third,
existing benchmarks rely solely on textual information in issue descriptions,
overlooking multimodal information such as images in issues. In this paper, we
propose OmniGIRL, a GitHub Issue ResoLution benchmark that is multilingual,
multimodal, and multi-domain. OmniGIRL includes 959 task instances, which are
collected from repositories across four programming languages (i.e., Python,
JavaScript, TypeScript, and Java) and eight different domains. Our evaluation
shows that current LLMs show limited performances on OmniGIRL. Notably, the
best-performing model, GPT-4o, resolves only 8.6% of the issues. Besides, we
find that current LLMs struggle to resolve issues requiring understanding
images. The best performance is achieved by Claude-3.5-Sonnet, which resolves
only 10.5% of the issues with image information. Finally, we analyze the
reasons behind current LLMs' failure on OmniGIRL, providing insights for future
improvements.Summary
AI-Generated Summary