Banco de Teste AV-Odyssey: Será que Seus LLMs Multimodais Realmente Entendem Informações Áudio-Visuais?

Resumo

Recentemente, modelos de linguagem multimodais de grande porte (MLLMs), como GPT-4o, Gemini 1.5 Pro e Reka Core, expandiram suas capacidades para incluir modalidades de visão e áudio. Enquanto esses modelos demonstram um desempenho impressionante em uma ampla gama de aplicações audiovisuais, nosso DeafTest proposto revela que MLLMs frequentemente enfrentam dificuldades com tarefas simples que os humanos consideram triviais: 1) determinar qual dos dois sons é mais alto e 2) determinar qual dos dois sons tem um tom mais alto. Motivados por essas observações, introduzimos o AV-Odyssey Bench, um benchmark audiovisual abrangente projetado para avaliar se esses MLLMs realmente conseguem entender as informações audiovisuais. Este benchmark abrange 4.555 problemas cuidadosamente elaborados, cada um incorporando componentes de texto, visual e áudio. Para inferir respostas com sucesso, os modelos devem aproveitar efetivamente pistas tanto dos inputs visuais quanto dos inputs de áudio. Para garantir uma avaliação precisa e objetiva das respostas dos MLLMs, estruturamos as perguntas como múltipla escolha, eliminando a necessidade de avaliação humana ou avaliação assistida por LLM. Avaliamos uma série de modelos de código fechado e de código aberto e resumimos as observações. Ao revelar as limitações dos modelos atuais, temos como objetivo fornecer insights úteis para a coleta de dados futuros e o desenvolvimento de modelos.

English

Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure precise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for human evaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source models and summarize the observations. By revealing the limitations of current models, we aim to provide useful insight for future dataset collection and model development.

Banco de Teste AV-Odyssey: Será que Seus LLMs Multimodais Realmente Entendem Informações Áudio-Visuais?

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Resumo

Support