MMEvalPro: Calibrando Bancos de Dados Multimodais para uma Avaliação Confiável e Eficiente

Resumo

Os Grandes Modelos Multimodais (LMMs) demonstram impressionantes habilidades de compreensão e raciocínio intermodais, frequentemente avaliadas por meio de questões de múltipla escolha (MCQs) que incluem uma imagem, uma pergunta e várias opções. No entanto, muitos benchmarks usados para tais avaliações sofrem de viéses sistemáticos. Notavelmente, os Grandes Modelos de Linguagem (LLMs) sem capacidades de percepção visual alcançam desempenho não trivial, minando a credibilidade dessas avaliações. Para abordar esse problema, mantendo a eficiência das avaliações de MCQ, propomos o MMEvalPro, um benchmark projetado para evitar erros do Tipo I por meio de um pipeline de avaliação em três etapas e métricas mais rigorosas. Para cada pergunta original dos benchmarks existentes, os anotadores humanos aprimoram criando uma pergunta de percepção e uma pergunta de ancoragem de conhecimento por meio de um processo de anotação meticuloso. O MMEvalPro é composto por 2.138 tríades de perguntas, totalizando 6.414 perguntas distintas. Dois terços dessas perguntas são rotuladas manualmente por especialistas humanos, enquanto o restante é proveniente de benchmarks existentes (MMMU, ScienceQA e MathVista). Em comparação com os benchmarks existentes, nossos experimentos com os mais recentes LLMs e LMMs demonstram que o MMEvalPro é mais desafiador (o melhor LMM fica atrás do desempenho humano em 31,73%, em comparação com uma lacuna média de 8,03% nos benchmarks anteriores) e mais confiável (o melhor LLM fica atrás do melhor LMM em 23,09%, enquanto a diferença nos benchmarks anteriores é de apenas 14,64%). Nossa análise aprofundada explica o motivo da grande diferença de desempenho e justifica a confiabilidade da avaliação, destacando seu significativo potencial para avançar em pesquisas futuras.

English

Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises 2,138 question triplets, totaling 6,414 distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by 31.73%, compared to an average gap of 8.03% in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by 23.09%, whereas the gap for previous benchmarks is just 14.64%). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.

MMEvalPro: Calibrando Bancos de Dados Multimodais para uma Avaliação Confiável e Eficiente

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Resumo

Support