时间盲区：为何视频语言模型难以捕捉人类视角？

摘要

近期，视觉-语言模型（VLMs）在理解视频中的时空关系方面取得了显著进展。然而，当空间信息被遮蔽时，这些模型难以捕捉纯粹的时间模式。我们引入了SpookyBench，一个基准测试，其中信息仅编码在噪声状帧的时间序列中，模拟了从生物信号到隐蔽通信的自然现象。有趣的是，尽管人类能以超过98%的准确率识别这些序列中的形状、文本和模式，最先进的VLMs却实现了0%的准确率。这一性能差距揭示了一个关键局限：过度依赖帧级空间特征，以及无法从时间线索中提取意义。此外，当在空间信噪比（SNR）较低的数据集上训练时，模型的时间理解能力比人类感知退化得更快，尤其是在需要精细时间推理的任务中。克服这一局限将需要新的架构或训练范式，以将空间依赖性与时间处理解耦。我们的系统分析表明，这一问题在不同模型规模和架构中普遍存在。我们发布SpookyBench，旨在促进时间模式识别的研究，并弥合人类与机器在视频理解方面的差距。数据集和代码已发布在我们的项目网站上：https://timeblindness.github.io/。

English

Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce SpookyBench, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.

时间盲区：为何视频语言模型难以捕捉人类视角？

Time Blindness: Why Video-Language Models Can't See What Humans Can?

摘要

Support