AV-Odyssey基准：您的多模态LLM真的能理解视听信息吗？

摘要

最近，多模态大型语言模型（MLLMs），如GPT-4o、Gemini 1.5 Pro和Reka Core，已经扩展了它们的能力，包括视觉和音频模态。虽然这些模型在各种音频-视觉应用中展现出令人印象深刻的性能，但我们提出的DeafTest揭示了MLLMs经常在人类认为微不足道的简单任务上遇到困难：1）确定两个声音中哪个更大声，2）确定两个声音中哪个音调更高。受到这些观察的启发，我们引入了AV-Odyssey Bench，这是一个全面的音频-视觉基准，旨在评估这些MLLMs是否真正理解音频-视觉信息。该基准包含了4,555个精心设计的问题，每个问题都融合了文本、视觉和音频组件。为了成功推断答案，模型必须有效地利用来自视觉和音频输入的线索。为了确保对MLLM响应的精确和客观评估，我们将问题设计为多项选择题，消除了人工评估或LLM辅助评估的需要。我们对一系列闭源和开源模型进行基准测试，并总结观察结果。通过揭示当前模型的局限性，我们旨在为未来数据集收集和模型开发提供有用的见解。

English

Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure precise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for human evaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source models and summarize the observations. By revealing the limitations of current models, we aim to provide useful insight for future dataset collection and model development.

AV-Odyssey基准：您的多模态LLM真的能理解视听信息吗？

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

摘要

Support