時間認識の欠如：なぜビデオ言語モデルは人間が見えるものが見えないのか？

要旨

近年の視覚言語モデル（VLM）は、ビデオにおける時空間関係の理解において目覚ましい進歩を遂げてきました。しかし、空間情報が不明瞭な場合、これらのモデルは純粋な時間的パターンを捉えるのに苦労します。本論文では、生物学的シグナリングから秘密通信まで、自然現象を模倣したノイズのようなフレームの時間的シーケンスにのみ情報がエンコードされたベンチマーク「SpookyBench」を紹介します。興味深いことに、人間はこれらのシーケンスにおいて形状、テキスト、パターンを98%以上の精度で認識できるのに対し、最先端のVLMの精度は0%に留まります。この性能差は、フレームレベルの空間的特徴への過度な依存と、時間的キューから意味を抽出できないという重大な限界を浮き彫りにしています。さらに、空間的な信号対雑音比（SNR）が低いデータセットで訓練された場合、モデルの時間的理解は人間の知覚よりも急速に劣化し、特に細かい時間的推論を必要とするタスクにおいてその傾向が顕著です。この限界を克服するには、空間的依存性を時間的処理から切り離す新しいアーキテクチャや訓練パラダイムが必要となります。我々の体系的な分析は、この問題がモデルの規模やアーキテクチャを問わず持続することを示しています。我々は、時間的パターン認識の研究を促進し、人間と機械のビデオ理解のギャップを埋めるためにSpookyBenchを公開しました。データセットとコードはプロジェクトウェブサイト（https://timeblindness.github.io/）で公開されています。

English

Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce SpookyBench, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.

時間認識の欠如：なぜビデオ言語モデルは人間が見えるものが見えないのか？

Time Blindness: Why Video-Language Models Can't See What Humans Can?

要旨

Support