大型语言模型能否助力多模态语言分析?MMLA:一项全面基准测试
Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
April 23, 2025
作者: Hanlei Zhang, Zhuohang Li, Yeshuang Zhu, Hua Xu, Peiwu Wang, Haige Zhu, Jie Zhou, Jinchao Zhang
cs.AI
摘要
多模态语言分析是一个快速发展的领域,它通过整合多种模态来深化对人类对话话语中高层语义的理解。尽管其重要性不言而喻,但针对多模态大语言模型(MLLMs)在认知层面语义理解能力的研究却相对匮乏。本文中,我们推出了MMLA,一个专门为解决这一空白而设计的全面基准测试。MMLA包含了超过61,000条来自模拟及真实场景的多模态话语,涵盖了意图、情感、对话行为、情绪、说话风格和沟通行为这六大核心多模态语义维度。我们采用零样本推理、监督微调和指令调优三种方法,对八大主流LLMs及MLLMs分支进行了评估。大量实验表明,即便是经过微调的模型,其准确率也仅达到约60%~70%,凸显了当前MLLMs在理解复杂人类语言方面的局限性。我们坚信,MMLA将为探索大语言模型在多模态语言分析中的潜力奠定坚实基础,并为推动该领域发展提供宝贵资源。数据集与代码已开源,访问地址为https://github.com/thuiar/MMLA。
English
Multimodal language analysis is a rapidly evolving field that leverages
multiple modalities to enhance the understanding of high-level semantics
underlying human conversational utterances. Despite its significance, little
research has investigated the capability of multimodal large language models
(MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce
MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA
comprises over 61K multimodal utterances drawn from both staged and real-world
scenarios, covering six core dimensions of multimodal semantics: intent,
emotion, dialogue act, sentiment, speaking style, and communication behavior.
We evaluate eight mainstream branches of LLMs and MLLMs using three methods:
zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive
experiments reveal that even fine-tuned models achieve only about 60%~70%
accuracy, underscoring the limitations of current MLLMs in understanding
complex human language. We believe that MMLA will serve as a solid foundation
for exploring the potential of large language models in multimodal language
analysis and provide valuable resources to advance this field. The datasets and
code are open-sourced at https://github.com/thuiar/MMLA.Summary
AI-Generated Summary