大型语言模型能否助力多模态语言分析？MMLA：一项全面基准测试

摘要

多模态语言分析是一个快速发展的领域，它通过整合多种模态来深化对人类对话话语中高层语义的理解。尽管其重要性不言而喻，但针对多模态大语言模型（MLLMs）在认知层面语义理解能力的研究却相对匮乏。本文中，我们推出了MMLA，一个专门为解决这一空白而设计的全面基准测试。MMLA包含了超过61,000条来自模拟及真实场景的多模态话语，涵盖了意图、情感、对话行为、情绪、说话风格和沟通行为这六大核心多模态语义维度。我们采用零样本推理、监督微调和指令调优三种方法，对八大主流LLMs及MLLMs分支进行了评估。大量实验表明，即便是经过微调的模型，其准确率也仅达到约60%~70%，凸显了当前MLLMs在理解复杂人类语言方面的局限性。我们坚信，MMLA将为探索大语言模型在多模态语言分析中的潜力奠定坚实基础，并为推动该领域发展提供宝贵资源。数据集与代码已开源，访问地址为https://github.com/thuiar/MMLA。

English

Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.

大型语言模型能否助力多模态语言分析？MMLA：一项全面基准测试

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

摘要

Support