ChatPaper.aiChatPaper

大型语言模型能否助力多模态语言分析?MMLA:一项全面基准测试

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

April 23, 2025
作者: Hanlei Zhang, Zhuohang Li, Yeshuang Zhu, Hua Xu, Peiwu Wang, Haige Zhu, Jie Zhou, Jinchao Zhang
cs.AI

摘要

多模态语言分析是一个快速发展的领域,它通过整合多种模态来深化对人类对话话语中高层语义的理解。尽管其重要性不言而喻,但针对多模态大语言模型(MLLMs)在认知层面语义理解能力的研究却相对匮乏。本文中,我们推出了MMLA,一个专门为解决这一空白而设计的全面基准测试。MMLA包含了超过61,000条来自模拟及真实场景的多模态话语,涵盖了意图、情感、对话行为、情绪、说话风格和沟通行为这六大核心多模态语义维度。我们采用零样本推理、监督微调和指令调优三种方法,对八大主流LLMs及MLLMs分支进行了评估。大量实验表明,即便是经过微调的模型,其准确率也仅达到约60%~70%,凸显了当前MLLMs在理解复杂人类语言方面的局限性。我们坚信,MMLA将为探索大语言模型在多模态语言分析中的潜力奠定坚实基础,并为推动该领域发展提供宝贵资源。数据集与代码已开源,访问地址为https://github.com/thuiar/MMLA。
English
Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.

Summary

AI-Generated Summary

PDF152April 28, 2025