ColorBench: VLMはカラフルな世界を見て理解できるか？色の知覚、推論、ロバスト性に関する包括的ベンチマーク

要旨

色は人間の知覚において重要な役割を果たし、視覚的推論においてしばしば決定的な手がかりを提供します。しかし、視覚言語モデル（VLMs）が色を人間のように知覚し、理解し、活用できるかどうか、またその方法については不明です。本論文では、色の理解能力を評価するために、色の知覚、推論、堅牢性を含むColorBenchという革新的なベンチマークを紹介します。実アプリケーションに基づいた多様なテストシナリオを厳選し、ColorBenchはこれらのモデルが色をどのように知覚し、色に基づく手がかりから意味を推論し、さまざまな色変換下で一貫した性能を維持するかを評価します。32の異なる言語モデルと視覚エンコーダを持つVLMsを広範に評価することで、本論文はいくつかの未発見の知見を明らかにします：(i) スケーリング則（大きなモデルほど優れている）はColorBenchにおいても成立するが、言語モデルは視覚エンコーダよりも重要な役割を果たす。(ii) しかし、モデル間の性能差は比較的小さく、色の理解が既存のVLMsによって大きく無視されていることを示唆する。(iii) CoT推論は色の理解精度と堅牢性を向上させるが、それらは視覚中心のタスクである。(iv) 色の手がかりはColorBenchにおいて確かにVLMsによって活用されるが、一部のタスクではモデルを誤解させることもある。これらの知見は、現在のVLMsの重大な限界を浮き彫りにし、色の理解を強化する必要性を強調します。我々のColorBenchは、マルチモーダルAIにおける人間レベルの色の理解を進めるための基礎的なツールとして役立つでしょう。

English

Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.

ColorBench: VLMはカラフルな世界を見て理解できるか？色の知覚、推論、ロバスト性に関する包括的ベンチマーク

ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

要旨

Support