视觉语言模型能破解“贝壳游戏”吗？

摘要

视觉实体追踪是人类与生俱来的认知能力，却始终是视觉语言模型（VLM）发展的关键瓶颈。现有视频基准测试中的视觉捷径往往掩盖了这一缺陷。我们推出VET-Bench合成诊断测试平台，其特点是通过时空连续性专门追踪视觉完全相同的物体。实验表明，当前最先进的VLM在VET-Bench上的表现接近随机猜测水平，暴露出根本性局限：过度依赖静态帧级特征，且无法持续维护实体表征。通过结合状态追踪问题展开理论分析，我们证明基于固定深度Transformer的VLM受表达能力限制，在缺乏中间监督的情况下本质上无法追踪不可区分物体。为此，我们提出时空锚定思维链（SGCoT）方法：将物体轨迹生成作为显式中间状态。借助Molmo2的物体追踪能力，我们通过对合成的纯文本数据进行微调来实现SGCoT推理对齐。该方法在VET-Bench上实现了超过90%的最先进准确率，证明VLM无需外部工具即可端到端可靠解决视频"杯球游戏"任务。代码与数据详见https://vetbench.github.io。

English

Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .

视觉语言模型能破解“贝壳游戏”吗？

Can Vision-Language Models Solve the Shell Game?

摘要

Support