ChatPaper.aiChatPaper

OmniGIRL:面向GitHub问题解决的多语言多模态基准测试平台

OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution

May 7, 2025
作者: Lianghong Guo, Wei Tao, Runhan Jiang, Yanlin Wang, Jiachi Chen, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, Zibin Zheng
cs.AI

摘要

GitHub问题解决任务旨在自动处理仓库中报告的问题。随着大语言模型(LLMs)的进步,该任务日益受到关注,并提出了多个基准来评估LLMs的问题解决能力。然而,现有基准存在三个主要局限。首先,当前基准集中于单一编程语言,限制了跨语言仓库问题的评估。其次,它们通常覆盖领域狭窄,可能无法代表现实世界问题的多样性。第三,现有基准仅依赖问题描述中的文本信息,忽视了问题中图像等多模态信息。本文提出OmniGIRL,一个多语言、多模态、多领域的GitHub问题解决基准。OmniGIRL包含959个任务实例,收集自四种编程语言(即Python、JavaScript、TypeScript和Java)及八个不同领域的仓库。我们的评估显示,当前LLMs在OmniGIRL上表现有限。值得注意的是,表现最佳的模型GPT-4o仅解决了8.6%的问题。此外,我们发现当前LLMs在需要理解图像的问题上表现不佳。Claude-3.5-Sonnet以10.5%的解决率在处理含图像信息的问题上表现最佳。最后,我们分析了当前LLMs在OmniGIRL上失败的原因,为未来改进提供了洞见。
English
The GitHub issue resolution task aims to resolve issues reported in repositories automatically. With advances in large language models (LLMs), this task has gained increasing attention, and several benchmarks are proposed to evaluate the issue resolution ability of LLMs. However, existing benchmarks have three main limitations. First, current benchmarks focus on a single programming language, limiting the evaluation of issues from repositories across different languages. Second, they usually cover a narrow range of domains, which may fail to represent the diversity of real-world issues. Third, existing benchmarks rely solely on textual information in issue descriptions, overlooking multimodal information such as images in issues. In this paper, we propose OmniGIRL, a GitHub Issue ResoLution benchmark that is multilingual, multimodal, and multi-domain. OmniGIRL includes 959 task instances, which are collected from repositories across four programming languages (i.e., Python, JavaScript, TypeScript, and Java) and eight different domains. Our evaluation shows that current LLMs show limited performances on OmniGIRL. Notably, the best-performing model, GPT-4o, resolves only 8.6% of the issues. Besides, we find that current LLMs struggle to resolve issues requiring understanding images. The best performance is achieved by Claude-3.5-Sonnet, which resolves only 10.5% of the issues with image information. Finally, we analyze the reasons behind current LLMs' failure on OmniGIRL, providing insights for future improvements.

Summary

AI-Generated Summary

PDF61May 8, 2025