超越识别:评估视觉语言模型中的视觉视角理解能力
Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models
May 3, 2025
作者: Gracjan Góral, Alicja Ziarko, Piotr Miłoś, Michał Nauman, Maciej Wołczyk, Michał Kosiński
cs.AI
摘要
我们研究了视觉语言模型(VLMs)在执行视觉视角采择任务中的能力,这些任务灵感来源于经典的人类测试。我们的方法利用精心控制的场景,其中单个类人迷你模型与单个物体配对。通过系统地改变空间配置——如物体相对于类人迷你模型的位置以及类人迷你模型的朝向——并采用鸟瞰图和表面视图,我们创建了144个独特的视觉任务。每个视觉任务都配有一系列7个诊断性问题,旨在评估三个层次的视觉认知:场景理解、空间推理和视觉视角采择。我们对多个前沿模型(包括GPT-4-Turbo、GPT-4o、Llama-3.2-11B-Vision-Instruct及Claude Sonnet的变体)的评估显示,尽管它们在场景理解上表现出色,但在空间推理上的表现显著下降,而在视角采择方面则进一步恶化。我们的分析表明,在表面层次的物体识别与复杂视觉任务所需的深层空间和视角推理之间存在差距,这提示未来VLM开发中需要整合明确的几何表示和定制化的训练协议。
English
We investigate the ability of Vision Language Models (VLMs) to perform visual
perspective taking using a novel set of visual tasks inspired by established
human tests. Our approach leverages carefully controlled scenes, in which a
single humanoid minifigure is paired with a single object. By systematically
varying spatial configurations - such as object position relative to the
humanoid minifigure and the humanoid minifigure's orientation - and using both
bird's-eye and surface-level views, we created 144 unique visual tasks. Each
visual task is paired with a series of 7 diagnostic questions designed to
assess three levels of visual cognition: scene understanding, spatial
reasoning, and visual perspective taking. Our evaluation of several
state-of-the-art models, including GPT-4-Turbo, GPT-4o,
Llama-3.2-11B-Vision-Instruct, and variants of Claude Sonnet, reveals that
while they excel in scene understanding, the performance declines significantly
on spatial reasoning and further deteriorates on perspective-taking. Our
analysis suggests a gap between surface-level object recognition and the deeper
spatial and perspective reasoning required for complex visual tasks, pointing
to the need for integrating explicit geometric representations and tailored
training protocols in future VLM development.Summary
AI-Generated Summary