大規模言語モデルはマルチモーダル言語分析に役立つか？MMLA：包括的ベンチマーク

要旨

マルチモーダル言語分析は、人間の会話発話に内在する高次セマンティクスの理解を強化するために複数のモダリティを活用する急速に進化する分野です。その重要性にもかかわらず、マルチモーダル大規模言語モデル（MLLMs）が認知レベルのセマンティクスを理解する能力を調査した研究はほとんどありません。本論文では、このギャップを埋めるために特別に設計された包括的なベンチマークであるMMLAを紹介します。MMLAは、ステージングされたシナリオと実世界のシナリオから抽出された61,000以上のマルチモーダル発話で構成され、意図、感情、対話行為、感情、話し方、コミュニケーション行動というマルチモーダルセマンティクスの6つの核心次元をカバーしています。我々は、ゼロショット推論、教師ありファインチューニング、および指示チューニングという3つの方法を用いて、LLMとMLLMの8つの主流ブランチを評価します。広範な実験により、ファインチューニングされたモデルでさえ約60％～70％の精度しか達成できないことが明らかになり、現在のMLLMが複雑な人間の言語を理解する上での限界が浮き彫りになりました。我々は、MMLAがマルチモーダル言語分析における大規模言語モデルの可能性を探るための堅固な基盤として機能し、この分野を前進させるための貴重なリソースを提供すると信じています。データセットとコードはhttps://github.com/thuiar/MMLAでオープンソース化されています。

English

Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.

大規模言語モデルはマルチモーダル言語分析に役立つか？MMLA：包括的ベンチマーク

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

要旨

Support