视觉语言模型能否推断人类视线方向？一项对照研究

摘要

视线参照推理——即推断他人正在注视何物的能力——是支撑自然人机交互的心理理论中的关键组成部分。在一项控制性研究中，我们利用经过难度和多样性调整的照片，评估了111个视觉语言模型（VLMs）在这方面的技能，并将其表现与65名人类参与者进行了对比，同时采用混合效应模型分析了行为模式。研究发现，111个VLMs中有94个未能超越随机猜测的水平，而人类则接近完美准确率。VLMs甚至对每个选项的响应频率几乎均等，这是否意味着它们在随机猜测？尽管大多数VLMs表现欠佳，但当我们聚焦于五个表现优于随机水平的顶级VLMs时，发现它们的表现随任务难度增加而下降，但在不同提示和场景对象间仅略有波动。这些行为特征无法通过将其视为随机猜测者来解释。相反，它们可能结合了启发式方法和猜测，使得其表现受任务难度影响，但对感知变化保持稳健。这表明，缺乏视线推理能力的VLMs尚未成为能够自然与人交互的技术，但潜力依然存在。

English

Gaze-referential inference--the ability to infer what others are looking at--is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.

视觉语言模型能否推断人类视线方向？一项对照研究

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

摘要

Support