AV-Odyssey Bench: マルチモーダルLLMが本当にオーディオビジュアル情報を理解できるのか？

要旨

最近、GPT-4o、Gemini 1.5 Pro、Reka Coreなどの多モーダル大規模言語モデル（MLLMs）は、視覚と音声のモダリティを含めた機能を拡張してきました。これらのモデルは、さまざまな音声・視覚アプリケーションで印象的なパフォーマンスを示していますが、提案されたDeafTestによると、MLLMsはしばしば人間にとって些細なものと考えられる簡単なタスクに苦戦することがあります。具体的には、1）2つの音のうちどちらが大きいかを判断すること、2）2つの音のうちどちらの音が高いかを判断することです。これらの観察に基づき、我々はAV-Odyssey Benchを導入します。これは、これらのMLLMsが本当に音声・視覚情報を理解できるかどうかを評価するために設計された包括的な音声・視覚ベンチマークです。このベンチマークには、テキスト、視覚、音声の要素を組み合わせた4,555の慎重に作成された問題が含まれています。モデルが正しい答えを推論するためには、視覚と音声の入力から効果的に手がかりを活用する必要があります。MLLMの回答を正確かつ客観的に評価するために、我々は質問を選択式に構造化し、人間の評価やLLM支援による評価の必要性を排除しています。我々は、一連のクローズドソースとオープンソースのモデルをベンチマークし、観察結果をまとめます。現行モデルの限界を明らかにすることで、将来のデータセット収集やモデル開発に有益な示唆を提供することを目指しています。

English

Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure precise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for human evaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source models and summarize the observations. By revealing the limitations of current models, we aim to provide useful insight for future dataset collection and model development.

AV-Odyssey Bench: マルチモーダルLLMが本当にオーディオビジュアル情報を理解できるのか？

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

要旨

Support