ChatPaper.aiChatPaper

OmniGIRL:一個多語言多模態的GitHub問題解決基準測試平台

OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution

May 7, 2025
作者: Lianghong Guo, Wei Tao, Runhan Jiang, Yanlin Wang, Jiachi Chen, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, Zibin Zheng
cs.AI

摘要

GitHub 問題解決任務旨在自動化解決存儲庫中報告的問題。隨著大型語言模型(LLMs)的進步,該任務日益受到關注,並提出了多個基準來評估LLMs的問題解決能力。然而,現有基準存在三個主要限制。首先,當前基準僅專注於單一編程語言,限制了對跨語言存儲庫問題的評估。其次,它們通常涵蓋的領域範圍狹窄,可能無法代表現實世界問題的多樣性。第三,現有基準僅依賴於問題描述中的文本信息,忽略了問題中的多模態信息,如圖像。本文提出OmniGIRL,一個多語言、多模態、多領域的GitHub問題解決基準。OmniGIRL包含959個任務實例,這些實例來自四種編程語言(即Python、JavaScript、TypeScript和Java)和八個不同領域的存儲庫。我們的評估顯示,當前LLMs在OmniGIRL上的表現有限。值得注意的是,表現最佳的模型GPT-4o僅解決了8.6%的問題。此外,我們發現當前LLMs在需要理解圖像的問題上表現不佳。表現最佳的是Claude-3.5-Sonnet,它僅解決了10.5%包含圖像信息的問題。最後,我們分析了當前LLMs在OmniGIRL上失敗的原因,為未來的改進提供了見解。
English
The GitHub issue resolution task aims to resolve issues reported in repositories automatically. With advances in large language models (LLMs), this task has gained increasing attention, and several benchmarks are proposed to evaluate the issue resolution ability of LLMs. However, existing benchmarks have three main limitations. First, current benchmarks focus on a single programming language, limiting the evaluation of issues from repositories across different languages. Second, they usually cover a narrow range of domains, which may fail to represent the diversity of real-world issues. Third, existing benchmarks rely solely on textual information in issue descriptions, overlooking multimodal information such as images in issues. In this paper, we propose OmniGIRL, a GitHub Issue ResoLution benchmark that is multilingual, multimodal, and multi-domain. OmniGIRL includes 959 task instances, which are collected from repositories across four programming languages (i.e., Python, JavaScript, TypeScript, and Java) and eight different domains. Our evaluation shows that current LLMs show limited performances on OmniGIRL. Notably, the best-performing model, GPT-4o, resolves only 8.6% of the issues. Besides, we find that current LLMs struggle to resolve issues requiring understanding images. The best performance is achieved by Claude-3.5-Sonnet, which resolves only 10.5% of the issues with image information. Finally, we analyze the reasons behind current LLMs' failure on OmniGIRL, providing insights for future improvements.

Summary

AI-Generated Summary

PDF61May 8, 2025