視覚言語モデルは貝殻ゲームを解けるか？

要旨

視覚的実体追跡は人間に備わった生得的認知能力であるが、Vision-Language Model（VLM）にとっては依然として重大なボトルネックとなっている。この欠陥は既存の動画ベンチマークでは視覚的ショートカットによってしばしば見えにくくなっている。我々はVET-Benchを提案する。これは視覚的に同一の物体を特徴とし、時空間的連続性のみを通じた追跡を必要とする合成的診断テストベッドである。実験結果から、現在の最先端VLMはVET-Benchにおいて偶然レベルかそれに近い性能しか示さず、根本的限界が明らかになった：静的フレームレベルの特徴への過度な依存と、時間経過に伴う実体表現の維持の失敗である。状態追跡問題との関連性を理論的に分析し、中間監督なしでは表現力の制約から、固定深度のTransformerベースVLMが識別不能な物体の追跡において本質的に限界があることを証明する。この問題に対処するため、Spatiotemporal Grounded Chain-of-Thought（SGCoT）を提案する：物体軌道を明示的中間状態として生成する手法である。Molmo2の物体追跡能力を活用し、合成テキストデータのみによるファインチューニングでアライメントを図ることでSGCoT推論を誘導する。本手法はVET-Benchにおいて90%を超える最先端精度を達成し、VLMが外部ツールなしで動画シェルゲーム課題をエンドツーエンドで確実に解決できることを実証する。コードとデータはhttps://vetbench.github.io で公開している。

English

Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .

視覚言語モデルは貝殻ゲームを解けるか？

Can Vision-Language Models Solve the Shell Game?

要旨

Support