多层次语言模型能理解中国图像背后的深层含义吗？

摘要

随着多模态大型语言模型（MLLMs）的能力不断提升，对MLLMs进行更高阶能力评估的需求也在增加。然而，目前缺乏对MLLM在理解和感知中文视觉内容方面进行更高阶评估的研究。为填补这一空白，我们引入了**中文图像涵义理解基准测试**，即**CII-Bench**，旨在评估MLLMs对中文图像的更高阶感知和理解能力。与现有基准测试相比，CII-Bench在几个方面脱颖而出。首先，为确保中文背景的真实性，CII-Bench中的图像来自中国互联网并经过人工审核，相应答案也经过人工精心制作。此外，CII-Bench还包含代表中国传统文化的图像，如著名的中国传统绘画，这些图像可以深刻反映模型对中国传统文化的理解。通过在多个MLLMs上对CII-Bench进行广泛实验，我们取得了重要发现。首先，在CII-Bench上观察到MLLMs的表现与人类之间存在显著差距。MLLMs的最高准确率达到64.4%，而人类的平均准确率为78.2%，最高可达令人印象深刻的81.0%。随后，MLLMs在中国传统文化图像上表现较差，表明它们在理解高层语义和缺乏对中国传统文化的深入知识库方面存在局限。最后，观察到大多数模型在提示中加入图像情感线索后准确率有所提升。我们相信CII-Bench将帮助MLLMs更好地理解中文语义和中文特定图像，推动迈向专家级人工通用智能（AGI）的道路。我们的项目可在https://cii-bench.github.io/ 上公开获取。

English

As the capabilities of Multimodal Large Language Models (MLLMs) continue to improve, the need for higher-order capability evaluation of MLLMs is increasing. However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To fill the gap, we introduce the **C**hinese **I**mage **I**mplication understanding **Bench**mark, **CII-Bench**, which aims to assess the higher-order perception and understanding capabilities of MLLMs for Chinese images. CII-Bench stands out in several ways compared to existing benchmarks. Firstly, to ensure the authenticity of the Chinese context, images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. Additionally, CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings, which can deeply reflect the model's understanding of Chinese traditional culture. Through extensive experiments on CII-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on CII-Bench. The highest accuracy of MLLMs attains 64.4%, where as human accuracy averages 78.2%, peaking at an impressive 81.0%. Subsequently, MLLMs perform worse on Chinese traditional culture images, suggesting limitations in their ability to understand high-level semantics and lack a deep knowledge base of Chinese traditional culture. Finally, it is observed that most models exhibit enhanced accuracy when image emotion hints are incorporated into the prompts. We believe that CII-Bench will enable MLLMs to gain a better understanding of Chinese semantics and Chinese-specific images, advancing the journey towards expert artificial general intelligence (AGI). Our project is publicly available at https://cii-bench.github.io/.