대규모 언어 모델이 다중모달 언어 분석에 도움을 줄 수 있을까? MMLA: 포괄적인 벤치마크

초록

다중모달 언어 분석은 인간 대화 발화의 고차원적 의미를 더 깊이 이해하기 위해 여러 모달리티를 활용하는 빠르게 발전하는 분야입니다. 그 중요성에도 불구하고, 다중모달 대형 언어 모델(MLLMs)이 인지 수준의 의미를 이해하는 능력을 조사한 연구는 거의 없습니다. 본 논문에서는 이러한 격차를 해소하기 위해 특별히 설계된 포괄적인 벤치마크인 MMLA를 소개합니다. MMLA는 연출된 시나리오와 실제 상황에서 추출한 61,000개 이상의 다중모달 발화로 구성되어 있으며, 의도, 감정, 대화 행위, 감정, 말투, 커뮤니케이션 행동이라는 다중모달 의미의 여섯 가지 핵심 차원을 다룹니다. 우리는 여덟 가지 주요 LLM 및 MLLM 분야를 제로샷 추론, 지도 미세 조정, 명령어 튜닝이라는 세 가지 방법으로 평가했습니다. 광범위한 실험 결과, 미세 조정된 모델조차도 약 60%~70%의 정확도만 달성하는 것으로 나타나, 현재의 MLLM이 복잡한 인간 언어를 이해하는 데 한계가 있음을 보여줍니다. 우리는 MMLA가 다중모달 언어 분석에서 대형 언어 모델의 잠재력을 탐구하는 데 견고한 기반이 되고, 이 분야를 발전시키는 데 유용한 자원을 제공할 것이라고 믿습니다. 데이터셋과 코드는 https://github.com/thuiar/MMLA에서 공개되었습니다.

English

Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.

대규모 언어 모델이 다중모달 언어 분석에 도움을 줄 수 있을까? MMLA: 포괄적인 벤치마크

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

초록

Support