時間盲視：為何視訊語言模型無法洞察人類所見？

摘要

近期，視覺語言模型（VLMs）在理解視頻中的時空關係方面取得了顯著進展。然而，當空間信息被遮蔽時，這些模型在捕捉純粹的時間模式上顯得力不從心。我們引入了SpookyBench，這是一個基準測試，其中信息僅通過噪聲般幀的時間序列進行編碼，模擬了從生物信號到隱蔽通信的自然現象。有趣的是，雖然人類能夠以超過98%的準確率識別這些序列中的形狀、文字和模式，但最先進的VLMs卻達到了0%的準確率。這一性能差距揭示了一個關鍵限制：過度依賴於幀級別的空間特徵，以及無法從時間線索中提取意義。此外，當在空間信噪比（SNR）較低的數據集上進行訓練時，模型的時間理解能力比人類感知退化得更快，尤其是在需要精細時間推理的任務中。克服這一限制將需要新的架構或訓練範式，以將空間依賴性與時間處理解耦。我們的系統分析表明，這一問題在模型規模和架構中普遍存在。我們發布SpookyBench，旨在催化時間模式識別的研究，並縮小人類與機器在視頻理解上的差距。數據集和代碼已在我們的項目網站上提供：https://timeblindness.github.io/。

English

Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce SpookyBench, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.

時間盲視：為何視訊語言模型無法洞察人類所見？

Time Blindness: Why Video-Language Models Can't See What Humans Can?

摘要

Support