视觉语言模型中的基础类别应用
Basic Category Usage in Vision Language Models
March 16, 2025
作者: Hunter Sawyer, Jesse Roberts, Kyle Moore
cs.AI
摘要
心理学领域早已认识到人类在标记视觉刺激时所采用的一种基础分类层次,这一概念由Rosch于1976年提出。研究发现,这一分类层次被使用得最为频繁,具有更高的信息密度,并能在人类视觉语言任务中通过启动效应提供帮助。本文探讨了两种近期发布的开源视觉语言模型(VLMs)中的基础层次分类行为。研究表明,Llama 3.2 Vision Instruct(11B)和Molmo 7B-D均倾向于采用与人类行为一致的基础层次分类。此外,这些模型的偏好与人类微妙的行为特征相符,如生物与非生物基础层次效应以及广为人知的专家基础层次转变,进一步表明VLMs从训练所用的人类数据中习得了认知分类行为。
English
The field of psychology has long recognized a basic level of categorization
that humans use when labeling visual stimuli, a term coined by Rosch in 1976.
This level of categorization has been found to be used most frequently, to have
higher information density, and to aid in visual language tasks with priming in
humans. Here, we investigate basic level categorization in two recently
released, open-source vision-language models (VLMs). This paper demonstrates
that Llama 3.2 Vision Instruct (11B) and Molmo 7B-D both prefer basic level
categorization consistent with human behavior. Moreover, the models'
preferences are consistent with nuanced human behaviors like the biological
versus non-biological basic level effects and the well established expert basic
level shift, further suggesting that VLMs acquire cognitive categorization
behaviors from the human data on which they are trained.Summary
AI-Generated Summary