超越識別：評估視覺語言模型中的視角採納能力

摘要

我們研究了視覺語言模型（VLMs）在執行視角轉換任務上的能力，這些任務受到已確立的人類測試啟發而設計。我們的方法利用精心控制的場景，其中單個擬人化迷你人偶與單個物體配對。通過系統性地改變空間配置——例如物體相對於擬人化迷你人偶的位置以及迷你人偶的朝向——並使用鳥瞰圖和表面視圖，我們創建了144個獨特的視覺任務。每個視覺任務都配有一系列7個診斷性問題，旨在評估三個層次的視覺認知：場景理解、空間推理和視角轉換。我們對多個最先進的模型進行了評估，包括GPT-4-Turbo、GPT-4o、Llama-3.2-11B-Vision-Instruct以及Claude Sonnet的變體，結果顯示，雖然它們在場景理解方面表現出色，但在空間推理上的表現顯著下降，而在視角轉換上的表現則進一步惡化。我們的分析表明，表面層次的物體識別與複雜視覺任務所需的深層次空間和視角推理之間存在差距，這表明在未來VLM的開發中需要整合明確的幾何表示和定制的訓練協議。

English

We investigate the ability of Vision Language Models (VLMs) to perform visual perspective taking using a novel set of visual tasks inspired by established human tests. Our approach leverages carefully controlled scenes, in which a single humanoid minifigure is paired with a single object. By systematically varying spatial configurations - such as object position relative to the humanoid minifigure and the humanoid minifigure's orientation - and using both bird's-eye and surface-level views, we created 144 unique visual tasks. Each visual task is paired with a series of 7 diagnostic questions designed to assess three levels of visual cognition: scene understanding, spatial reasoning, and visual perspective taking. Our evaluation of several state-of-the-art models, including GPT-4-Turbo, GPT-4o, Llama-3.2-11B-Vision-Instruct, and variants of Claude Sonnet, reveals that while they excel in scene understanding, the performance declines significantly on spatial reasoning and further deteriorates on perspective-taking. Our analysis suggests a gap between surface-level object recognition and the deeper spatial and perspective reasoning required for complex visual tasks, pointing to the need for integrating explicit geometric representations and tailored training protocols in future VLM development.

超越識別：評估視覺語言模型中的視角採納能力

Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

摘要

Support