시간적 맹목: 비디오-언어 모델은 왜 인간이 보는 것을 볼 수 없는가?

초록

비전-언어 모델(VLMs)의 최근 발전은 비디오에서의 시공간적 관계 이해에 있어 인상적인 진전을 이루었습니다. 그러나 공간 정보가 가려진 경우, 이러한 모델들은 순수한 시간적 패턴을 포착하는 데 어려움을 겪습니다. 우리는 생물학적 신호 전달부터 은밀한 통신에 이르기까지 자연 현상을 반영한, 노이즈와 같은 프레임의 시간적 순열로만 정보가 인코딩된 벤치마크인 SpookyBench를 소개합니다. 흥미롭게도, 인간은 이러한 순열에서 모양, 텍스트, 패턴을 98% 이상의 정확도로 인식할 수 있는 반면, 최첨단 VLMs은 0%의 정확도를 보입니다. 이 성능 격차는 프레임 수준의 공간적 특징에 대한 과도한 의존과 시간적 단서로부터 의미를 추출하지 못하는 중요한 한계를 드러냅니다. 더욱이, 낮은 공간적 신호 대 잡음비(SNR)를 가진 데이터셋에서 훈련된 경우, 모델의 시간적 이해는 인간의 인지보다 더 빠르게 저하되며, 특히 미세한 시간적 추론이 필요한 작업에서 더욱 두드러집니다. 이 한계를 극복하기 위해서는 공간적 의존성을 시간적 처리로부터 분리하는 새로운 아키텍처나 훈련 패러다임이 필요할 것입니다. 우리의 체계적인 분석은 이 문제가 모델 규모와 아키텍처 전반에 걸쳐 지속됨을 보여줍니다. 우리는 시간적 패턴 인식 연구를 촉진하고 인간과 기계의 비디오 이해 간의 격차를 해소하기 위해 SpookyBench를 공개합니다. 데이터셋과 코드는 우리 프로젝트 웹사이트(https://timeblindness.github.io/)에서 이용 가능합니다.

English

Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce SpookyBench, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.

시간적 맹목: 비디오-언어 모델은 왜 인간이 보는 것을 볼 수 없는가?

Time Blindness: Why Video-Language Models Can't See What Humans Can?

초록

Support