視覚言語モデルにおける基本カテゴリの使用

要旨

心理学の分野では、人間が視覚刺激をラベル付けする際に用いる基本的なカテゴリー化のレベルが長年認識されており、これは1976年にロッシュによって提唱された概念である。このカテゴリー化のレベルは、最も頻繁に使用され、情報密度が高く、プライミングを用いた視覚言語タスクにおいて人間を支援することが明らかになっている。本稿では、最近公開された2つのオープンソースの視覚言語モデル（VLM）における基本的なカテゴリー化を調査する。本論文は、Llama 3.2 Vision Instruct (11B) と Molmo 7B-D の両方が、人間の行動と一致する基本的なカテゴリー化を好むことを示している。さらに、これらのモデルの選好は、生物と非生物の基本的レベル効果や、よく確立された専門家の基本的レベルシフトといった、微妙な人間の行動とも一致しており、VLMが訓練に用いた人間のデータから認知的カテゴリー化行動を獲得していることをさらに示唆している。

English

The field of psychology has long recognized a basic level of categorization that humans use when labeling visual stimuli, a term coined by Rosch in 1976. This level of categorization has been found to be used most frequently, to have higher information density, and to aid in visual language tasks with priming in humans. Here, we investigate basic level categorization in two recently released, open-source vision-language models (VLMs). This paper demonstrates that Llama 3.2 Vision Instruct (11B) and Molmo 7B-D both prefer basic level categorization consistent with human behavior. Moreover, the models' preferences are consistent with nuanced human behaviors like the biological versus non-biological basic level effects and the well established expert basic level shift, further suggesting that VLMs acquire cognitive categorization behaviors from the human data on which they are trained.