視覺語言模型能否推斷人類視線方向？一項對照研究

摘要

視線推論——即推斷他人正在注視何物的能力——是支撐自然人機互動的心智理論中的關鍵組成部分。在一項控制性研究中，我們利用拍攝難度與多樣性經過調整的照片，評估了111個視覺語言模型（VLMs）在此技能上的表現，並將其與65名人類參與者的表現進行了對比，同時採用混合效應模型分析了行為模式。結果發現，111個VLMs中有94個未能超越隨機猜測的水平，而人類則接近完美準確。VLMs甚至對每個選項的回應頻率幾乎均等。它們是在隨機猜測嗎？儘管大多數VLMs表現欠佳，但當我們聚焦於五個表現優於隨機猜測的頂尖VLMs時，發現它們的表現隨任務難度增加而下降，但在不同提示和場景物體間僅有輕微變化。這些行為特徵無法通過將其視為隨機猜測者來解釋。相反，它們可能結合了啟發式方法和猜測，使得其表現受任務難度影響，但對感知變化具有穩健性。這表明，缺乏視線推論能力的VLMs尚未成為能與人類自然互動的技術，但潛力依然存在。

English

Gaze-referential inference--the ability to infer what others are looking at--is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.

視覺語言模型能否推斷人類視線方向？一項對照研究

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

摘要

Support