視覚言語モデルは人間の視線方向を推論できるか？制御された研究

要旨

視線参照推論——他者が何を見ているかを推測する能力——は、自然な人間-AIインタラクションを支える心の理論の重要な構成要素である。制御された研究において、私たちは111の視覚言語モデル（VLM）を対象に、難易度と多様性を操作して撮影した写真を用いてこのスキルを評価し、人間の参加者（N = 65）のパフォーマンスと比較し、混合効果モデルを用いて行動を分析した。その結果、111のVLMのうち94がランダムな推測を上回る結果を示せなかったのに対し、人間はほぼ天井効果に近い精度を達成した。VLMは各選択肢に対してほぼ均等に応答しており、ランダムな推測を行っているのか？ほとんどのVLMが苦戦する中、上位5つのVLMに焦点を当てると、それらのパフォーマンスはタスクの難易度が増すにつれて低下したが、異なるプロンプトやシーンオブジェクト間での変動はわずかであった。これらの行動特性は、ランダムな推測者として考えるだけでは説明できない。むしろ、ヒューリスティックと推測を組み合わせて使用しており、そのパフォーマンスはタスクの難易度に影響を受けるが、知覚的な変動に対しては頑健であると考えられる。これは、視線推論能力を欠くVLMが、人間と自然にインタラクションできる技術にはまだ至っていないことを示唆しているが、可能性は残されている。

English

Gaze-referential inference--the ability to infer what others are looking at--is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.

視覚言語モデルは人間の視線方向を推論できるか？制御された研究

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

要旨

Support