MLLM（Massive Language Models）は、中国語の画像の奥深い含意を理解できるか？

要旨

Multimodal Large Language Models（MLLMs）の能力が向上し続ける中、MLLMsの高次能力評価の必要性が高まっています。しかし、中国語の視覚コンテンツに対するMLLMの高次認識と理解を評価する研究が不足しています。このギャップを埋めるために、**C**hinese **I**mage **I**mplication understanding **Bench**mark、**CII-Bench**を導入します。これは、MLLMsの中国語画像に対する高次認識と理解能力を評価することを目的としています。CII-Benchは既存のベンチマークと比較していくつかの点で際立っています。まず、中国の文脈の信憑性を確保するために、CII-Benchの画像は中国のインターネットから取得され、手動でレビューされ、対応する回答も手動で作成されています。さらに、CII-Benchには有名な中国の伝統的な絵画など、中国の伝統文化を表す画像が取り入れられており、モデルが中国の伝統文化を理解する能力を深く反映しています。複数のMLLMsでCII-Benchでの幅広い実験を通じて、重要な発見がなされました。まず、MLLMsの性能と人間の性能との間に大きな差が観察されました。MLLMsの最高精度は64.4%であり、一方人間の精度は平均78.2%で、最高で81.0%に達しています。その後、MLLMsは中国の伝統文化の画像で性能が低下し、高度なセマンティクスを理解する能力に制限があり、中国の伝統文化に対する深い知識ベースが欠如していることが示唆されました。最後に、画像の感情的ヒントがプロンプトに組み込まれると、ほとんどのモデルが精度が向上することが観察されました。CII-Benchは、MLLMsが中国語のセマンティクスと中国固有の画像についてより良い理解を得ることを可能にし、専門家レベルの人工汎用知能（AGI）に向けた道のりを前進させると信じています。当プロジェクトはhttps://cii-bench.github.io/で公開されています。

English

As the capabilities of Multimodal Large Language Models (MLLMs) continue to improve, the need for higher-order capability evaluation of MLLMs is increasing. However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To fill the gap, we introduce the **C**hinese **I**mage **I**mplication understanding **Bench**mark, **CII-Bench**, which aims to assess the higher-order perception and understanding capabilities of MLLMs for Chinese images. CII-Bench stands out in several ways compared to existing benchmarks. Firstly, to ensure the authenticity of the Chinese context, images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. Additionally, CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings, which can deeply reflect the model's understanding of Chinese traditional culture. Through extensive experiments on CII-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on CII-Bench. The highest accuracy of MLLMs attains 64.4%, where as human accuracy averages 78.2%, peaking at an impressive 81.0%. Subsequently, MLLMs perform worse on Chinese traditional culture images, suggesting limitations in their ability to understand high-level semantics and lack a deep knowledge base of Chinese traditional culture. Finally, it is observed that most models exhibit enhanced accuracy when image emotion hints are incorporated into the prompts. We believe that CII-Bench will enable MLLMs to gain a better understanding of Chinese semantics and Chinese-specific images, advancing the journey towards expert artificial general intelligence (AGI). Our project is publicly available at https://cii-bench.github.io/.