비전-언어 모델은 쉘 게임을 풀 수 있을까?

초록

시각적 개체 추적은 인간에게 내재된 인지 능력이지만, 여전히 비전-언어 모델(VLMs)의 주요 병목 현상으로 남아 있습니다. 이러한 결함은 기존 비디오 벤치마크에서 시각적 단축키(shortcuts)에 의해 종종 가려집니다. 우리는 시공간적 연속성을 통해서만 추적이 가능한 시각적으로 동일한 객체들을 특징으로 하는 합성 진단 테스트베드인 VET-Bench를 소개합니다. 우리의 실험 결과, 최첨단 VLM들이 VET-Bench에서 우연 수준 또는 그에 근접한 성능을 보여 근본적인 한계를 드러냈습니다. 이는 정적 프레임 수준 특징에 대한 과도한 의존성과 시간에 따른 개체 표현 유지의 실패입니다. 우리는 상태 추적 문제와의 연관성을 이론적으로 분석하여, 고정 깊이의 트랜스포머 기반 VLM이 표현력 제약으로 인해 중간 감독 없이는 구별할 수 없는 객체를 추적하는 데 근본적으로 한계가 있음을 증명합니다. 이를 해결하기 위해 우리는 시공간적 기반 사고의 연쇄(SGCoT), 즉 명시적 중간 상태로서 객체 궤적을 생성하는 방법을 제안합니다. Molmo2의 객체 추적 능력을 활용하여, 정렬을 위해 합성된 텍스트 전용 데이터에 대한 미세 조정을 통해 SGCoT 추론을 유도합니다. 우리의 방법은 VET-Bench에서 90%를 넘는 최첨단 정확도를 달성하여, VLM이 외부 도구 없이도 비디오 쉘 게임 작업을 종단간(end-to-end)으로 안정적으로 해결할 수 있음을 입증합니다. 우리의 코드와 데이터는 https://vetbench.github.io에서 확인할 수 있습니다.

English

Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .

비전-언어 모델은 쉘 게임을 풀 수 있을까?

Can Vision-Language Models Solve the Shell Game?

초록

Support