ChatPaper.aiChatPaper

視覺語言模型中的基礎類別使用

Basic Category Usage in Vision Language Models

March 16, 2025
作者: Hunter Sawyer, Jesse Roberts, Kyle Moore
cs.AI

摘要

心理學領域長期以來已認識到人類在標記視覺刺激時所使用的一種基本層次分類,此概念由Rosch於1976年提出。研究發現,這一層次的分類最為常用,具有更高的信息密度,並能通過啟動效應輔助人類完成視覺語言任務。本文探討了兩種近期發布的開源視覺語言模型(VLMs)中的基本層次分類行為。研究表明,Llama 3.2 Vision Instruct(11B)和Molmo 7B-D均表現出與人類行為一致的基本層次分類偏好。此外,這些模型的偏好還與人類的細微行為相符,如生物與非生物基本層次效應以及廣為人知的專家基本層次轉移現象,進一步表明視覺語言模型從其訓練所基於的人類數據中習得了認知分類行為。
English
The field of psychology has long recognized a basic level of categorization that humans use when labeling visual stimuli, a term coined by Rosch in 1976. This level of categorization has been found to be used most frequently, to have higher information density, and to aid in visual language tasks with priming in humans. Here, we investigate basic level categorization in two recently released, open-source vision-language models (VLMs). This paper demonstrates that Llama 3.2 Vision Instruct (11B) and Molmo 7B-D both prefer basic level categorization consistent with human behavior. Moreover, the models' preferences are consistent with nuanced human behaviors like the biological versus non-biological basic level effects and the well established expert basic level shift, further suggesting that VLMs acquire cognitive categorization behaviors from the human data on which they are trained.

Summary

AI-Generated Summary

PDF32March 18, 2025