OmniGIRL:面向GitHub问题解决的多语言多模态基准测试平台
OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution
May 7, 2025
作者: Lianghong Guo, Wei Tao, Runhan Jiang, Yanlin Wang, Jiachi Chen, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, Zibin Zheng
cs.AI
摘要
GitHub问题解决任务旨在自动处理仓库中报告的问题。随着大语言模型(LLMs)的进步,该任务日益受到关注,并提出了多个基准来评估LLMs的问题解决能力。然而,现有基准存在三个主要局限。首先,当前基准集中于单一编程语言,限制了跨语言仓库问题的评估。其次,它们通常覆盖领域狭窄,可能无法代表现实世界问题的多样性。第三,现有基准仅依赖问题描述中的文本信息,忽视了问题中图像等多模态信息。本文提出OmniGIRL,一个多语言、多模态、多领域的GitHub问题解决基准。OmniGIRL包含959个任务实例,收集自四种编程语言(即Python、JavaScript、TypeScript和Java)及八个不同领域的仓库。我们的评估显示,当前LLMs在OmniGIRL上表现有限。值得注意的是,表现最佳的模型GPT-4o仅解决了8.6%的问题。此外,我们发现当前LLMs在需要理解图像的问题上表现不佳。Claude-3.5-Sonnet以10.5%的解决率在处理含图像信息的问题上表现最佳。最后,我们分析了当前LLMs在OmniGIRL上失败的原因,为未来改进提供了洞见。
English
The GitHub issue resolution task aims to resolve issues reported in
repositories automatically. With advances in large language models (LLMs), this
task has gained increasing attention, and several benchmarks are proposed to
evaluate the issue resolution ability of LLMs. However, existing benchmarks
have three main limitations. First, current benchmarks focus on a single
programming language, limiting the evaluation of issues from repositories
across different languages. Second, they usually cover a narrow range of
domains, which may fail to represent the diversity of real-world issues. Third,
existing benchmarks rely solely on textual information in issue descriptions,
overlooking multimodal information such as images in issues. In this paper, we
propose OmniGIRL, a GitHub Issue ResoLution benchmark that is multilingual,
multimodal, and multi-domain. OmniGIRL includes 959 task instances, which are
collected from repositories across four programming languages (i.e., Python,
JavaScript, TypeScript, and Java) and eight different domains. Our evaluation
shows that current LLMs show limited performances on OmniGIRL. Notably, the
best-performing model, GPT-4o, resolves only 8.6% of the issues. Besides, we
find that current LLMs struggle to resolve issues requiring understanding
images. The best performance is achieved by Claude-3.5-Sonnet, which resolves
only 10.5% of the issues with image information. Finally, we analyze the
reasons behind current LLMs' failure on OmniGIRL, providing insights for future
improvements.Summary
AI-Generated Summary