MMRefine：揭示多模态大语言模型在稳健精炼中的挑战

摘要

本文介绍了MMRefine，一个多模态精炼基准，旨在评估多模态大语言模型（MLLMs）的错误修正能力。随着研究重点转向推理过程中的性能提升，MMRefine提供了一个框架，不仅限于比较精炼前后的最终准确率，还评估MLLMs在六种不同场景下检测和纠正错误的能力。此外，该基准通过将错误划分为六种类型来分析精炼性能。通过对多种开源和闭源MLLMs的实验，揭示了阻碍精炼性能的瓶颈和因素，为有效推理增强指明了改进方向。我们的代码和数据集已公开于https://github.com/naver-ai/MMRefine。

English

This paper introduces MMRefine, a MultiModal Refinement benchmark designed to evaluate the error refinement capabilities of Multimodal Large Language Models (MLLMs). As the emphasis shifts toward enhancing reasoning during inference, MMRefine provides a framework that evaluates MLLMs' abilities to detect and correct errors across six distinct scenarios beyond just comparing final accuracy before and after refinement. Furthermore, the benchmark analyzes the refinement performance by categorizing errors into six error types. Experiments with various open and closed MLLMs reveal bottlenecks and factors impeding refinement performance, highlighting areas for improvement in effective reasoning enhancement. Our code and dataset are publicly available at https://github.com/naver-ai/MMRefine.