視覺語言模型能破解貝殼遊戲嗎？

摘要

視覺實體追蹤是人類與生俱來的認知能力，但對視覺語言模型而言仍是關鍵瓶頸。現有影片基準測試中的視覺捷徑往往掩蓋了這項缺陷。我們提出VET-Bench——一個合成診斷測試平台，其特點在於包含視覺上完全相同的物體，必須僅透過時空連續性進行追蹤。實驗顯示，當前最先進的視覺語言模型在VET-Bench上的表現接近隨機猜測水準，暴露出根本性限制：過度依賴靜態幀級特徵，且無法隨時間維持實體表徵。我們透過理論分析連結狀態追蹤問題，證明基於固定深度轉換器的視覺語言模型因表達能力限制，本質上無法在缺乏中間監督的情況下追蹤不可區分物體。為解決此問題，我們提出時空接地思維鏈：將物體軌跡生成為顯式中間狀態。藉由Molmo2的物體追蹤能力，我們透過對合成純文本數據進行微調來實現對齊，從而激發SGCoT推理。我們的方法在VET-Bench上實現超過90%的頂尖準確率，證明視覺語言模型無需外部工具即可端到端可靠解決影片殼牌遊戲任務。程式碼與數據請見：https://vetbench.github.io。

English

Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .

視覺語言模型能破解貝殼遊戲嗎？

Can Vision-Language Models Solve the Shell Game?

摘要

Support